PROJECT 1¶

Research Question:¶

How have the temperatures across states in the US changed overtime and what patterns can be observed in different regions? Is there a correlation between temperature patterns and economic indicators in different regions? What factors may contribute to these changes?

Google Drive link (to view interactive maps) : https://drive.google.com/file/d/1cGmB5OXDHyk6p8I7rvhwGk-ReAr9z5jk/view?usp=sharing Note: Please download the file in the google drive link and then view it to see the interactive maps and graphs.¶

1.1 Introduction¶

I will use the "Earth's Surface Termperature Data" dataset to conduct my research.

For this project, I want to find the historical temperature changes for the US overtime, with data starting from the second industrial revolution (1900), since this marks a larger change in temperature. I want to be able to measure which states had the largest temperature changes overtime and which ones has the lowest temperature changes overtime. I also want to be able to measure the variability temperature changes overtime in a given US state to see if there is a temperature pattern.

Once finding these changes, I will use other outside variables for certain indicators in a state such as the population/state to determine if these outside economic factors are significant in the temperature findings overtime. My reason for the research question is that I want to be able to determine how/why certain states in the US have had larger/smaller temperature increases/decreases overtime and the economic indicators that might point to the answer. If the findings are economically important, these indicators can be good determinants of predicting a state temperature overtime and giverning policies can be centered around climate issues that may affect the long term temperature of a given state. Geographic regions are also important to consider, so I will consider those as well when interpreting the importance of my findings.

I will use variables such as US state temp and find the percent change, time, population percent change and economic indicators.

Summary Findings¶

For the temperature change in the US states overtime, so far, I have discovered that there is a larger change in temperature increases for US states that are geographically in the north and western region. Furthermore, I have found that using my summary statistics data and mapping over the US states groupbed by the seasons, that the winter months in any given state are a lot more prone to temperature variability and have changed a lot more in temperature compared to the baseline years for the data. To make an accurate comparison overtime, I have used the mean temperature from years 1900-1950 as the "baseline years temperature", and compared this temperature to the temperatures against other years as benchmark.

This is a seasonal trend that I have uncovered, and have noticed that this trend is more prominent in northern and southern states. Meaning that in the winter time, nothern states have had a larger and more severe trend towards a decrease in temperature than the southern states. Also, in the spring and fall time, we can see that there is a larger temperature decrase trend in nothern states that are closer to bodies of water. This may point to the fact large bodies of water may be related to larger temperature variation in the spring and fall in neighbouring states. For an economic interpretation, I have also included population percent changes per decade for each state and compared that against the percent temperature change in each decade per state. While I could not uncover the linear trends because I had too little data for population percentage change, I noticed that some states, while having a large population overtime, do not have large temperature changes, indicating that population may not play a large part in the states that have less temperature variability than geographic location.

1.2 Data Cleaning/Loading¶

In [ ]:
# Lets clean and load the US temperature state data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
plt.rcParams["figure.figsize"]=(20,20)
# Uncomment following line to install on colab
! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes
! pip install linearmodels
! pip install sklearn
import geopandas as gpd
from shapely.geometry import Point
%matplotlib inline
import statsmodels.api as sm
import plotly.express as px
import plotly.graph_objs as go
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
from sklearn import linear_model
! pip install stargazer
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
In [2]:
# get the us data from the state data. 
In [86]:
# look at all the countries that are available in the state data.
state = pd.read_csv('GlobalLandTemperaturesByState.csv')
state = state[state['AverageTemperature'].notnull()]
state.head()
Out[86]:
dt AverageTemperature AverageTemperatureUncertainty State Country
0 1855-05-01 25.544 1.171 Acre Brazil
1 1855-06-01 24.228 1.103 Acre Brazil
2 1855-07-01 24.371 1.044 Acre Brazil
3 1855-08-01 25.427 1.073 Acre Brazil
4 1855-09-01 25.675 1.014 Acre Brazil
In [87]:
state = state.loc[state['Country'] == 'United States']
state = state.loc[state['dt'] >= '1900-01-01']
state.head()
Out[87]:
dt AverageTemperature AverageTemperatureUncertainty State Country
9332 1900-01-01 6.286 0.537 Alabama United States
9333 1900-02-01 6.523 0.520 Alabama United States
9334 1900-03-01 11.213 0.372 Alabama United States
9335 1900-04-01 17.103 0.313 Alabama United States
9336 1900-05-01 21.311 0.350 Alabama United States
In [88]:
# let's find the temp of each state overtime. first set datetime object.  
state["dt"]=pd.to_datetime(state["dt"],format="%Y-%m-%d")
state.set_index("dt", inplace=True)
state.head()
Out[88]:
AverageTemperature AverageTemperatureUncertainty State Country
dt
1900-01-01 6.286 0.537 Alabama United States
1900-02-01 6.523 0.520 Alabama United States
1900-03-01 11.213 0.372 Alabama United States
1900-04-01 17.103 0.313 Alabama United States
1900-05-01 21.311 0.350 Alabama United States
In [90]:
# now plot a new month column. 
state['Month'] = state.index.month
state['Year'] = state.index.year
state.head()
state.tail()
Out[90]:
AverageTemperature AverageTemperatureUncertainty State Country Month Year
dt
2013-05-01 10.607 0.208 Wyoming United States 5 2013
2013-06-01 16.267 0.276 Wyoming United States 6 2013
2013-07-01 20.222 0.133 Wyoming United States 7 2013
2013-08-01 19.621 0.217 Wyoming United States 8 2013
2013-09-01 15.811 1.101 Wyoming United States 9 2013
In [91]:
# import decade percent change data of the population - I will try to make a comparison of population change 
# against the temperature change later on. 
pop = pd.read_csv('apportionment.csv') # imported from government census website. 
# link: https://www.census.gov/data/tables/time-series/dec/popchange-data-text.html

pop = pop.loc[pop['Geography Type'] == 'State']
pop = pop.loc[pop['Name'] != 'Puerto Rico']
pop.rename(columns={'Name': 'State'}, inplace=True)
pop
Out[91]:
State Geography Type Year Resident Population Percent Change in Resident Population Resident Population Density Resident Population Density Rank Number of Representatives Change in Number of Representatives Average Apportionment Population Per Representative
0 Alabama State 1910 2,138,093 16.9 42.2 25.0 10.0 1.0 213,809
1 Alaska State 1910 64,356 1.2 0.1 52.0 NaN NaN NaN
2 Arizona State 1910 204,354 66.2 1.8 49.0 NaN NaN NaN
3 Arkansas State 1910 1,574,449 20.0 30.3 30.0 7.0 0.0 224,921
4 California State 1910 2,377,549 60.1 15.3 38.0 11.0 3.0 216,051
... ... ... ... ... ... ... ... ... ... ...
674 Virginia State 2020 8,631,393 7.9 218.6 16.0 11.0 0.0 786,777
675 Washington State 2020 7,705,281 14.6 115.9 24.0 10.0 0.0 771,595
676 West Virginia State 2020 1,793,716 -3.2 74.6 31.0 2.0 -1.0 897,523
677 Wisconsin State 2020 5,893,718 3.6 108.8 27.0 8.0 0.0 737,184
678 Wyoming State 2020 576,851 2.3 5.9 51.0 1.0 0.0 577,719

612 rows × 10 columns

1.3 Summary Statistics¶

Here we will find the summary statistics here to determine which variables we should use and which strategy will be best to answer the research question.

In [92]:
# Find the temp from the earliest year and 2013 for all the states -> see how the temperature changes. 
state_stats = state.groupby('State')['AverageTemperature'].agg('mean')
df = pd.DataFrame(state_stats)
# this is the average temperature per state starting from year 1900. 
state_stats2 = state.groupby('State')['AverageTemperature'].agg('std')
df2 = pd.DataFrame(state_stats2)
# merge. 
In [93]:
merged_stats = df.merge(df2, on='State')
merged_stats.rename(columns={'AverageTemperature_y': 'std'}, inplace=True)
merged_stats.rename(columns={'AverageTemperature_x': 'mean_temp'}, inplace=True)
merged_stats
Out[93]:
mean_temp std
State
Alabama 17.368358 7.224877
Alaska -4.575507 11.662883
Arizona 15.504199 7.842939
Arkansas 15.919018 8.199573
California 14.401892 6.357927
Colorado 7.156383 8.731812
Connecticut 9.563976 8.637473
Delaware 12.338136 8.436579
District Of Columbia 12.323663 8.853603
Florida 21.757914 4.736399
Georgia (State) 17.798509 6.863318
Hawaii 22.500436 1.406836
Idaho 5.619539 8.598028
Illinois 11.272790 9.899922
Indiana 11.252629 9.602761
Iowa 9.142975 10.982083
Kansas 12.582085 9.868345
Kentucky 13.281130 8.698185
Louisiana 19.325408 6.660028
Maine 4.920197 10.056386
Maryland 12.537802 8.512987
Massachusetts 8.197084 9.148826
Michigan 7.014218 9.786613
Minnesota 5.040005 11.988794
Mississippi 17.837702 7.261057
Missouri 12.743931 9.560668
Montana 5.364379 9.581042
Nebraska 9.445640 10.307543
Nevada 9.937901 8.355840
New Hampshire 6.195500 9.884228
New Jersey 11.099996 8.821496
New Mexico 12.015021 7.888679
New York 7.704903 9.686519
North Carolina 15.079593 7.518584
North Dakota 4.679619 12.202482
Ohio 10.682247 9.296196
Oklahoma 15.638133 8.905261
Oregon 8.357860 6.664890
Pennsylvania 9.547486 9.102623
Rhode Island 9.520399 8.606931
South Carolina 17.297130 7.131873
South Dakota 7.427638 11.209119
Tennessee 14.376760 8.139083
Texas 18.333853 7.402622
Utah 8.616795 8.934904
Vermont 5.974133 10.012610
Virginia 13.142423 8.132487
Washington 7.871788 6.940066
West Virginia 11.289492 8.472790
Wisconsin 6.446685 10.891783
Wyoming 5.321200 9.152611
In [63]:
merged_stats.reset_index(inplace=True)
In [64]:
plot = merged_stats.plot(x = 'State', y =['mean_temp', 'std'], kind='bar', 
                         figsize=(15,10), ylabel= 'Temperature (in Degrees Celcius)')

This plot displays the average temperature and standard deviation of the average temperature overtime using 1900-2013 historical monthly averages. From this we can determine how certain states may behave in terms of temeperature changes over the course of our analysis. For example, we can see that certain states have a lower standard deviation, hence we may observe these states have lower temperature variablity in any given year, while other states that have high standard devation may be have more variation in temperature changes.

Using this analysis, we can try to determine if these results from the graph are a byproduct of the geographic location of the state or whether there are other factors that can help predict temperature changes/variability.

Ok, now we have state average temperature and standard deviation summary statistics. Now I want to see if there is a regional pattern among states for the temp. However, geographically speaking, this is going to be a biased since cetain states in certain regions are more 'hot' than others because of their location (closer/further from the equator). Therefore, the classification of states in terms of the average temperature may not be that meaningful for the research question. To solve this problem, I can take the average temperature percent change of each state overtime and see some prominent changes in certain regions better to further aid my hypothesis.

However, from this dataframe, we can analyze the standard deviation (variation) of the average temperature of each state, and maybe this will tell us something about the variation of the temperature for each state. After all climate change can not only be defined through the overtime increase in temperature, but also higher variability in temperature overtime. Maybe there are certain states in certain regions that experience more variability in temperature changes overtime.

I will also look at seasonality, since for certain states (such as Alaska or Hawaii), the temperature changes might be more variable or systematically higher/lower on average due to its location, so taking the aggregate temp. may not be the more efficient way to analyze long term trends.

So overall, for a better look that the research question, I think its more meaningful to analyze the temperature change, standard devaation trends and any other potential variables overtime, instead of getting the overall average. We can get the summary stats for this next.

In this section, we will create a dataframe that can look at seasonal trends of the state temperate overtime.¶
In [94]:
# add season variable to take this into consideration.
state["Season"] = state["Month"].apply(lambda x: "Winter" if x in [12, 1, 2] else ("Spring" if x in [3, 4, 5] else ("Summer" if x in [6, 7, 8] else "Fall")))
state
Out[94]:
AverageTemperature AverageTemperatureUncertainty State Country Month Year Season
dt
1900-01-01 6.286 0.537 Alabama United States 1 1900 Winter
1900-02-01 6.523 0.520 Alabama United States 2 1900 Winter
1900-03-01 11.213 0.372 Alabama United States 3 1900 Spring
1900-04-01 17.103 0.313 Alabama United States 4 1900 Spring
1900-05-01 21.311 0.350 Alabama United States 5 1900 Spring
... ... ... ... ... ... ... ...
2013-05-01 10.607 0.208 Wyoming United States 5 2013 Spring
2013-06-01 16.267 0.276 Wyoming United States 6 2013 Summer
2013-07-01 20.222 0.133 Wyoming United States 7 2013 Summer
2013-08-01 19.621 0.217 Wyoming United States 8 2013 Summer
2013-09-01 15.811 1.101 Wyoming United States 9 2013 Fall

69613 rows × 7 columns

In [95]:
baseline_years = range(1900, 1951) # baseline years.
baseline_data = state[state['Year'].isin(baseline_years)]
In [96]:
# Group data by state and season
grouped_data = baseline_data.groupby(['State', 'Season'])['AverageTemperature'].mean().reset_index()
grouped_data.columns = ['State', 'Season', 'BaselineAvgTemperature']

# Merge baseline data with original data to get temperature change
merged_data = pd.merge(state, grouped_data, on=['State', 'Season'])
merged_data['TemperatureChange'] = ((merged_data['AverageTemperature'] - merged_data['BaselineAvgTemperature'])/merged_data['BaselineAvgTemperature'])*100

# Group data by state and season and calculate average temperature change and std deviation
grouped_data = merged_data.groupby(['State', 'Season']).agg({'TemperatureChange': ['mean', 'std']}).reset_index()
grouped_data.columns = ['State', 'Season', 'AvgTemperatureChange', 'StdTemperatureChange']

# Pivot the data to get the desired output format
output_data = grouped_data.pivot(index='State', columns='Season')
output_data.columns = [col[0] + '_' + col[1] for col in output_data.columns.values]
output_data.reset_index(inplace=True)

# Rename the columns to remove 'TemperatureChange' from the column names
output_data.columns = ['State'] + [col.replace('TemperatureChange', '') + 'Percent Change' for col in output_data.columns[1:]]

# Display the output data
series = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN','MS', 'MO', 'MT', 'NE', 'NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']
output_data['State Abbreviation'] = series
output_data.head()
Out[96]:
State Avg_FallPercent Change Avg_SpringPercent Change Avg_SummerPercent Change Avg_WinterPercent Change Std_FallPercent Change Std_SpringPercent Change Std_SummerPercent Change Std_WinterPercent Change State Abbreviation
0 Alabama 0.524667 1.207958 0.937016 -0.630510 27.929518 22.554549 4.194436 28.901118 AL
1 Alaska -1.654958 -3.666642 2.002998 -1.936255 160.516877 124.157545 13.571873 20.131822 AK
2 Arizona 1.308517 1.192448 1.130993 3.066347 33.406190 26.873746 6.067800 33.178413 AZ
3 Arkansas 0.393027 1.642405 0.990567 0.658951 32.052797 27.365292 5.894152 46.013985 AR
4 California 1.900165 0.985690 1.343946 3.938808 29.331947 23.579841 8.125712 25.076841 CA

Ok, so now we have a dataframe that depicts the seasonal percent change of temperature and its standard deviation in each state.

The measure of percent change is used against the baseline years because it takes into account the size of temperature magnitude and it normalizes it according to the analysis of change instead of size.

Instead of comparing the temperature change in a certain year to the base year, 1900, it would be more accurate for the research question to have baseline years from 1900-1950 instead, so we can find the average temperature between the baseline years and then find the percent change between this baseline temperature and a given temperature in a given state in a given season. This approach makes sense because climate in a given year can be variable and many different reasons can alter the climate in a given year, that may not carry over to the years after that (such as forest fires, unseasonably cold/warm ocean temperature or land temperature or other weather anomalies etc). Averaging out the temperature over a few years can provide a more accurate representation of long-term temperature trends. By using a baseline period that spans over multiple years, these short-term fluctuations are averaged out, providing a more stable and representative picture of the long-term trends in temperature. This baseline can then be used to assess changes in temperature over time and help inform decisions related to climate policy and adaptation.

Each state was also grouped by seasons to find the percent change in a given season. It is important to take into account seasons because temperature patterns vary depending on the season. For example, during the winter months, the temperature in a particular state or region may be significantly colder than during the summer months. By grouping the temperature data by season and calculating the average temperature change and standard deviation for each season, we can get a better understanding of how temperatures are changing over time for each season, rather than just looking at the overall temperature change. This can be important for understanding how climate change is affecting different regions in different seasons, and why this might be the case. This dataframe is packed with information so let's dig in!

In the winters, there is a lot more temperature variability compared to the other seasons. This is an interesting find, although not too surprising. One reason why it may be easier to see climate trends in states during winter months is that the temperature variations during this season tend to be larger than during other seasons, especially in northern regions of the US. This means that any changes or anomalies in temperature would be more apparent during the winter months.

However, it is also important to note that climate change impacts are not limited to changes in temperature alone, and other factors such as precipitation, extreme weather events, and sea level rise can also have significant impacts on different regions. Therefore, it would be important to analyze and consider these factors as well in order to fully understand the climate patterns and their potential impacts on different regions.

Creating multiple regressions per state to analyze seasonal/annual components¶

I predict there is a linear trend relationship between my independent and depdendent variable. According to this graph:

In [15]:
plt.style.use('seaborn')
state.plot(x='Year', y='AverageTemperature', kind='scatter')
plt.show()

We can see there is a linear relationship between year and temperature change overtime. This makes sense since temperature increments very slowly, and can be easily prdicted in the near future using an OLS estimete. To continue, in order to denote how important seasonal trends are in predicting temperature for a given state, a multiple regression model can be helpful, for each given state.

The independent variable will be each season, using dummy variables for each season in order to create categorical data. Next, the dependent variable be the temperature in degrees celcius. We want to find out if there are certain seasons that are better predictors of the temperature in a given state, and why this might be the case. This analysis will give some concrete statisical analysis to shed some light on the questoin that we are trying to answer.

In [98]:
merged_data # use this data, instead of <output_data> for linear regression because of more observations 
# our <output_data> column is for mapping analysis only. 
Out[98]:
AverageTemperature AverageTemperatureUncertainty State Country Month Year Season BaselineAvgTemperature TemperatureChange
0 6.286 0.537 Alabama United States 1 1900 Winter 8.210451 -23.439041
1 6.523 0.520 Alabama United States 2 1900 Winter 8.210451 -20.552476
2 7.733 0.539 Alabama United States 12 1900 Winter 8.210451 -5.815161
3 7.603 0.511 Alabama United States 1 1901 Winter 8.210451 -7.398509
4 6.040 0.637 Alabama United States 2 1901 Winter 8.210451 -26.435222
... ... ... ... ... ... ... ... ... ...
69608 -1.407 0.222 Wyoming United States 11 2011 Fall 5.814706 -124.197269
69609 14.491 0.221 Wyoming United States 9 2012 Fall 5.814706 149.212949
69610 5.455 0.264 Wyoming United States 10 2012 Fall 5.814706 -6.186141
69611 1.884 0.139 Wyoming United States 11 2012 Fall 5.814706 -67.599393
69612 15.811 1.101 Wyoming United States 9 2013 Fall 5.814706 171.914011

69613 rows × 9 columns

In [99]:
dummy_merged = merged_data.copy(deep=True)
In [100]:
dummy_merged['Winter'] = dummy_merged['Season'].apply(lambda x: 1 if x == 'Winter' else 0) 
dummy_merged['Spring'] = dummy_merged['Season'].apply(lambda x: 1 if x == 'Spring' else 0) 
dummy_merged['Fall'] = dummy_merged['Season'].apply(lambda x: 1 if x == 'Fall' else 0)
dummy_merged['Summer'] = dummy_merged['Season'].apply(lambda x: 1 if x == 'Summer' else 0)
In [101]:
grouped3 = dummy_merged.groupby('State')
In [102]:
lst1 = []
for groups in grouped3.groups:
    state2 = grouped3.get_group(groups)
    X = state2[['Winter', 'Fall', 'Spring', 'Summer', 'Year']]
    Y = state2[['AverageTemperature']]
 
    X = sm.add_constant(X)
    model = sm.OLS(Y, X).fit()
    lst1.append(model)
    predictions = model.predict(X) 
In [112]:
stargazer_2 = Stargazer(lst1)
lst = []
state_lst = []
for _ in range(51):
    lst.append(1)
    
for group in grouped3.groups:
    state_lst.append(group)

stargazer_2.custom_columns(state_lst, lst)
stargazer_2.show_model_numbers(False)
stargazer_2.show_degrees_of_freedom(False)


HTML(stargazer_2.render_html())
Out[112]:
Dependent variable:AverageTemperature
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict Of ColumbiaFloridaGeorgia (State)HawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
Fall1.482-5.665***0.5371.0750.306-1.494-0.680-0.181-0.3952.508***1.5031.773***-1.4730.2860.137-0.2570.1820.5401.860*-1.607-0.119-1.278-1.049-1.6921.5670.553-2.423*-0.884-0.811-1.821-0.656-0.367-1.3490.708-2.457-0.3311.069-0.642-0.892-0.6871.262-1.4710.8061.454-0.953-1.8730.117-1.219-0.273-1.128-1.968
(1.106)(1.833)(1.137)(1.196)(0.953)(1.240)(1.198)(1.196)(1.259)(0.764)(1.051)(0.218)(1.256)(1.414)(1.384)(1.562)(1.412)(1.279)(1.020)(1.355)(1.217)(1.262)(1.395)(1.700)(1.102)(1.366)(1.440)(1.490)(1.211)(1.359)(1.240)(1.125)(1.362)(1.115)(1.786)(1.344)(1.276)(1.016)(1.299)(1.192)(1.073)(1.635)(1.209)(1.076)(1.265)(1.387)(1.185)(1.057)(1.248)(1.541)(1.327)
Spring0.810-6.630***-1.2940.279-1.985**-3.137**-4.429***-3.107***-2.190*0.8810.753-0.396*-2.662**-1.398-1.598-1.792-1.288-0.5831.113-5.116***-2.325*-4.279***-4.512***-3.602**0.892-0.703-3.596**-2.371-2.333*-4.599***-3.288***-1.264-4.310***-0.467-3.969**-2.273*-0.087-2.239**-3.113**-4.456***0.349-3.175*-0.0150.887-2.372*-4.722***-1.182-2.066*-1.654-3.458**-3.648***
(1.107)(1.835)(1.138)(1.197)(0.954)(1.241)(1.199)(1.197)(1.260)(0.765)(1.052)(0.218)(1.257)(1.415)(1.385)(1.563)(1.414)(1.280)(1.021)(1.356)(1.218)(1.263)(1.396)(1.701)(1.103)(1.367)(1.441)(1.492)(1.212)(1.361)(1.241)(1.126)(1.363)(1.115)(1.788)(1.345)(1.277)(1.016)(1.300)(1.193)(1.074)(1.636)(1.210)(1.077)(1.266)(1.388)(1.186)(1.057)(1.249)(1.542)(1.328)
Summer9.742***9.384***9.884***10.519***7.472***9.056***8.030***8.760***9.507***6.810***9.249***1.903***8.977***10.947***10.536***11.504***11.650***10.306***9.265***8.461***9.193***8.296***8.908***10.757***9.840***11.192***8.942***11.195***9.665***8.580***8.802***9.374***8.731***9.252***10.732***9.738***11.388***7.370***8.929***7.981***9.368***11.172***10.146***10.025***9.983***8.556***9.337***7.155***9.163***10.134***9.226***
(1.107)(1.835)(1.138)(1.197)(0.954)(1.241)(1.199)(1.197)(1.260)(0.765)(1.052)(0.218)(1.257)(1.415)(1.385)(1.563)(1.414)(1.280)(1.021)(1.356)(1.218)(1.263)(1.396)(1.701)(1.103)(1.367)(1.441)(1.492)(1.212)(1.361)(1.241)(1.126)(1.363)(1.115)(1.788)(1.345)(1.277)(1.016)(1.300)(1.193)(1.074)(1.636)(1.210)(1.077)(1.266)(1.388)(1.186)(1.057)(1.249)(1.542)(1.328)
Winter-8.260***-19.416***-9.865***-10.171***-8.287***-13.079***-13.715***-12.491***-12.917***-4.657***-7.841***-0.691***-12.703***-14.101***-13.707***-16.309***-13.356***-11.580***-7.311***-17.123***-12.297***-14.905***-15.693***-19.581***-8.290***-13.029***-15.043***-14.861***-11.400***-16.550***-13.513***-10.644***-15.750***-9.622***-19.969***-13.698***-11.177***-9.178***-14.065***-13.691***-8.475***-17.084***-10.296***-8.675***-12.706***-16.852***-11.164***-10.155***-12.133***-17.426***-13.858***
(1.106)(1.834)(1.138)(1.197)(0.954)(1.241)(1.198)(1.196)(1.259)(0.764)(1.051)(0.218)(1.257)(1.414)(1.385)(1.563)(1.413)(1.280)(1.021)(1.356)(1.217)(1.263)(1.396)(1.701)(1.103)(1.366)(1.441)(1.491)(1.211)(1.360)(1.240)(1.125)(1.363)(1.115)(1.787)(1.345)(1.277)(1.016)(1.299)(1.192)(1.073)(1.636)(1.210)(1.076)(1.265)(1.387)(1.186)(1.057)(1.248)(1.541)(1.328)
Year0.006**0.012**0.008***0.007**0.009***0.009***0.012***0.011***0.010***0.008***0.007**0.010***0.008**0.008**0.009**0.009**0.008**0.008**0.007***0.012***0.010***0.012***0.011***0.012***0.007**0.008**0.010***0.009**0.008***0.012***0.011***0.008***0.012***0.008***0.012***0.010***0.007**0.007***0.011***0.012***0.007***0.011**0.007**0.007**0.008***0.013***0.009***0.008***0.009***0.011***0.009***
(0.003)(0.005)(0.003)(0.003)(0.002)(0.003)(0.003)(0.003)(0.003)(0.002)(0.003)(0.001)(0.003)(0.004)(0.004)(0.004)(0.004)(0.003)(0.003)(0.003)(0.003)(0.003)(0.004)(0.004)(0.003)(0.003)(0.004)(0.004)(0.003)(0.003)(0.003)(0.003)(0.003)(0.003)(0.005)(0.003)(0.003)(0.003)(0.003)(0.003)(0.003)(0.004)(0.003)(0.003)(0.003)(0.004)(0.003)(0.003)(0.003)(0.004)(0.003)
const3.775-22.328***-0.7381.702-2.494-8.653*-10.795**-7.019-5.9955.542*3.6632.589***-7.861-4.266-4.632-6.854-2.812-1.3174.926-15.385***-5.548-12.166**-12.347**-14.118**4.009-1.987-12.119**-6.920-4.880-14.391***-8.654*-2.901-12.678**-0.129-15.664**-6.5651.193-4.690-9.140*-10.852**2.503-10.5590.6413.691-6.048-14.891***-2.892-6.285-4.896-11.877*-10.248*
(4.379)(7.260)(4.503)(4.738)(3.776)(4.911)(4.744)(4.736)(4.984)(3.026)(4.163)(0.862)(4.975)(5.600)(5.482)(6.186)(5.594)(5.067)(4.040)(5.367)(4.819)(4.999)(5.525)(6.733)(4.366)(5.410)(5.703)(5.902)(4.795)(5.384)(4.910)(4.454)(5.395)(4.414)(7.074)(5.324)(5.055)(4.022)(5.144)(4.719)(4.249)(6.475)(4.790)(4.260)(5.010)(5.493)(4.694)(4.184)(4.942)(6.101)(5.256)
Observations1,3651,3641,3651,3651,3651,3651,3651,3651,3651,3651,3651,3641,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,3651,365
R20.7800.7680.8030.8000.7890.8110.8190.8110.8100.7550.7800.7760.7990.8080.8050.8100.8080.7970.7800.8290.8080.8210.8090.8110.7830.8080.7880.8040.8030.8220.8140.8090.8140.7940.7990.8040.8070.7820.8090.8200.7870.8000.7930.8020.8120.8200.8000.7820.7960.8120.803
Adjusted R20.7790.7680.8020.7990.7880.8100.8190.8110.8100.7550.7790.7750.7990.8080.8040.8090.8070.7960.7790.8290.8080.8210.8090.8110.7830.8080.7870.8030.8020.8220.8140.8090.8140.7930.7980.8030.8060.7810.8080.8190.7870.8000.7920.8010.8110.8190.8000.7820.7960.8120.802
Residual Std. Error3.3945.6213.4903.6722.9263.8073.6773.6703.8632.3463.2260.6683.8564.3404.2494.7944.3353.9273.1314.1603.7353.8744.2825.2183.3844.1934.4204.5753.7164.1733.8063.4524.1813.4215.4824.1263.9183.1173.9873.6583.2935.0183.7123.3023.8834.2573.6383.2433.8304.7284.073
F Statistic1205.077***1127.383***1381.998***1360.059***1269.757***1454.340***1541.552***1461.619***1451.082***1050.359***1203.344***1173.804***1355.394***1434.379***1402.036***1449.211***1426.915***1332.969***1202.682***1652.783***1431.446***1561.581***1441.355***1460.134***1230.049***1433.181***1262.266***1391.240***1383.849***1573.250***1492.084***1440.860***1489.946***1307.170***1349.269***1390.933***1421.963***1218.955***1437.799***1548.180***1259.319***1361.245***1299.236***1374.214***1465.906***1546.407***1364.223***1221.592***1328.958***1469.303***1381.692***
Note: *p<0.1; **p<0.05; ***p<0.01

Since we used regression analysis on each state, our model can be introduced like this:

$$ {AverageTemperature}_i = \beta_0 + \beta_1 {year}_i + \beta_2 {summer}_i + \beta_3 {spring}_i + \beta_4 {winter}_i + \beta_5 {fall}_i + u_i $$

where:

  • $ \beta_0 $ is the intercept of the linear trend line on the y-axis
  • $ \beta_1 $ is the year variable explaining how much temperature increases by (in Celcius) annually, on average in that state
  • $ \beta_{2..5} $ are the dummy variables that represent the average temperature that year in that state, or 0 if it is not that season. So, on average, how much the temperature changes in Celcius during that season compared to the baseline years' average temperature in that season, taking into account yearly changes and controlling for other seasons.
  • $ u_i $ is a random error term (deviations of observations from the linear trend due to factors not included in the model)

Here, each model corresponds to a regression run for a state. Grouping the regressions by seasons and state is the most reasonable way to go given the question we are trying to answer. We want to isolate the effects of seasonality for any given state, and see how the results of the states differ, i.e. which state may not have any or all statistically significant dummy variables, and how their geographic location might be the case.

I will do an interpretation of the results for Alaska for example, and then point out some general trends:

The multiple regression model shows that the season variables (Winter, Fall, Spring, and Summer) and the year variable have a significant impact on the average temperature in Alaska.

The R-squared value of 0.768 indicates that the model explains 76.8% of the variance in the dependent variable (Average Temperature). This means that the model is a reasonably good fit for the data.

The coefficients for the season variables show how much the average temperature in Alaska changes for each season, controlling for the other seasons, and year. The Winter coefficient of 19.416 indicates that, on average, after controlling for other seasonal trends and yearly temperature changes, the change in temperature in Alaska during winter is 19.4 degrees Celsius lower than the average temperature. Similarly, the Fall coefficient of 5.655 indicates that the temperature during the fall season is 5.655 degrees Celsius lower than the average temperature.

The coefficient for the Year variable is 0.012, which means that for every one-year increase in the Year variable, the temperature in Alaska increases by an average of 0.012 degrees Celsius, controlling for seasonality.

The p-values for all of the seasonal coefficients is significant at 1%, the p-value for the annual coeffienct is significant at 5%. which means that they are statistically significant at the 95% confidence level. This suggests that the explanatory variables are significant predictors of the average temperature in Alaska overtime.

Intuitively the coefficients make sense. The other seasons (Winter, Fall, and Spring) seem to have a negative impact on the average temperature in Alaska, while the Summer season has a positive impact (can be seen by the signs of the coefficients). This could be because Alaska is a colder state and experiences relatively mild summers, so the lower and more drastic temperatures in the other seasons has a greater impact on the temperature changes overtime. It could also be due to other factors like precipitation, cloud cover, and wind patterns, which can vary between seasons and affect the average temperature differently.

To assess my regression results, I will point out some trends that can be generalized among all the states. There is actually an interesting trend in the results for this regression, that may help us in furthering our understanding for the research question.

It's worth noting that the coefficient for the year variable is statistically significant for all states, indicating that temperature has been increasing over time, on average, across all states, controlling for seasonality! Ok so, even when we take into account seasonal temperature changes in our model, time plays a significant role on prediciting temperature in each state.

We can see most states have a p-value that is not statistically significant for the spring or fall month dummies.

Overall, most states have either the winter or summer dummy as a statistically significant predictor of the average temperature. It makes sense intuitively that the winter and summer dummies are more likely to be significant predictors of average temperature compared to spring and fall dummies. Winter and summer are the two most extreme seasons in terms of temperature. Therefore, it is more likely that the winter and summer dummies will have a stronger relationship with the average temperature compared to the more mild seasons of spring and fall. Additionally, the winter and summer seasons tend to have more distinct and consistent weather patterns, which can also contribute to their significance as predictors.

Next, I will also do a regression for each season! This aggregates the states results together so we only get to look at the seasonal changges overall. In this regression I will have the independent variables: Baseline Temperature for that season in that given year in that state, Temperature Change against the baseline in that month, and Year. The dependent variable will be Average Temperature in that given month in that season.

My reasoning for running this regression is that I want to isolate seasonal temperature effects by 1) first, controlling for that season's average temperature as a whole and then 2) look at the temperature change (fluctuations from that average) from that baseline average and in the end, looking that how significant these two might be in determining that season's overall temperature. This regression will make clear if the temperature fluctuations are a good indicator of average temperature - meaning that they are somewhat repetitive and have a clear pattern in any given state on a seasonal basis when regressed against average temperature.

The model is as such:

$$ {AverageTemperature}_i = \beta_0 + \beta_1 {year}_i + \beta_2 {BaselineTemp}_i + \beta_3 {TemperatureChange}_i + u_i $$

where:

  • $ \beta_0 $ is the intercept of the linear trend line on the y-axis
  • $ \beta_1 $ is the year variable explaining how much temperature increases by (in Celcius) annually, on average in that season
  • $ \beta_2 $ is the baseline average temperature in degrees Celsius for that season in a given year in a given state, aggregated for each state
  • $ \beta_3 $ is the temperature percent change in that given season for a given month for a certain state, which was calculated by the formula (averagetemp monthly temp in a state in a given year for a given season - baseline average temp for that season in that year for that state / baseline average temp for that season in that year for that state), in percent.
  • $ u_i $ is a random error term (deviations of observations from the linear trend due to factors not included in the model)
In [126]:
season_group = dummy_merged.groupby('Season')
In [127]:
lst_3 = []
for group in season_group.groups:
    season = season_group.get_group(group)
    X = season[['Year', 'BaselineAvgTemperature', 'TemperatureChange']]
    Y = season[['AverageTemperature']]
    X = sm.add_constant(X)
    model = sm.OLS(Y, X).fit()
    lst_3.append(model)
    predictions = model.predict(X)
    
In [128]:
stargazer_3 = Stargazer(lst_3)
season_lst = []
    
for group in season_group.groups:
    season_lst.append(group)

stargazer_3.custom_columns(season_lst, [1,1,1,1])
stargazer_3.show_model_numbers(False)
stargazer_3.show_degrees_of_freedom(False)


HTML(stargazer_3.render_html())
Out[128]:
Dependent variable:AverageTemperature
FallSpringSummerWinter
BaselineAvgTemperature1.004***1.008***1.012***0.981***
(0.005)(0.004)(0.001)(0.003)
TemperatureChange0.071***0.054***0.195***0.000
(0.000)(0.000)(0.000)(0.000)
Year0.003***0.004***0.001***0.011***
(0.001)(0.001)(0.000)(0.001)
const-5.181***-6.977***-1.558***-21.012***
(1.533)(1.272)(0.137)(1.192)
Observations17,33817,44217,44217,391
R20.7870.8420.9940.876
Adjusted R20.7870.8420.9940.876
Residual Std. Error3.3702.8200.2972.637
F Statistic21299.869***31038.304***991673.491***40867.928***
Note: *p<0.1; **p<0.05; ***p<0.01

As we can observe, most of the coefficients are statistically significant at the 99 percent confidence level. This means that for each season, the baseline temperature is a strong predictor of average temperature. This is not a surprising result since the baseline is calculated from the average temperature aggregations. The main reason for having this variable in this regression is to interpret the temperature change variable. As we can see, temperature change is statistically significant for all seasons except winter. So, for the interpretation, we can say that the overall fluctuations around a given mean temperature (baseline value) for summer, spring and fall are still statistically significant predictors of temperature activity. Meaning that the fluctuations need not be random, and there can be a somewhat predictability around the magnitude of these changes around the baseline values. However, for winter, the temperature fluctuations are not significant predictors of temperature at all. The coefficient is 0, meaning that these fluctuations do not contribute to finding the average temperature overall, while baseline values are still statistically significant. Intuitively, this means that winter fluctuations are too volatile to be good predictors of temperature. Alternatively, a downside of this regression is that we cannot see this breakdown on a state-by-state basis (the regressing would be too large and we would have to create dummies for each state), so we don't know if there may be a few states that are skewing the results (e.g. Alaska or North Dakota).

Next, use the population data to complement our findings and further our economic analysis. We will use population percent change per decade in US states starting from 1910 to complement our data. The hypothesis is that the higher population density in states will be related to the temperature changes in that given state. Since higher population can lead to higher temperature changes due to various economic reasons, there might be positive relation between the economic variable and the geographic variable. Lets test this!¶

In [23]:
# merge with ouput_data for comparison
pop2 = pop.merge(output_data, on='State')
pop2.head()
Out[23]:
State Geography Type Year Resident Population Percent Change in Resident Population Resident Population Density Resident Population Density Rank Number of Representatives Change in Number of Representatives Average Apportionment Population Per Representative Avg_FallPercent Change Avg_SpringPercent Change Avg_SummerPercent Change Avg_WinterPercent Change Std_FallPercent Change Std_SpringPercent Change Std_SummerPercent Change Std_WinterPercent Change State Abbreviation
0 Alabama State 1910 2,138,093 16.9 42.2 25.0 10.0 1.0 213,809 0.524667 1.207958 0.937016 -0.63051 27.929518 22.554549 4.194436 28.901118 AL
1 Alabama State 1920 2,348,174 9.8 46.4 25.0 10.0 0.0 234,817 0.524667 1.207958 0.937016 -0.63051 27.929518 22.554549 4.194436 28.901118 AL
2 Alabama State 1930 2,646,248 12.7 52.3 24.0 9.0 -1.0 294,027 0.524667 1.207958 0.937016 -0.63051 27.929518 22.554549 4.194436 28.901118 AL
3 Alabama State 1940 2,832,961 7.1 55.9 23.0 9.0 0.0 314,773 0.524667 1.207958 0.937016 -0.63051 27.929518 22.554549 4.194436 28.901118 AL
4 Alabama State 1950 3,061,743 8.1 60.5 24.0 9.0 0.0 340,194 0.524667 1.207958 0.937016 -0.63051 27.929518 22.554549 4.194436 28.901118 AL

So as you may have noticed, the data in pop is time series data with an observation being the percent change in a given decade in a given state. To make a meaningful comparison against the population change for temperature change, it would be unwise to compare cross-sectional temperature percent change for each season against the given decade percent change in population. So, in the next part, I will basically group the 'state' data again for decade and state, instead of season and state to find the percent change per decade for each state using all observations. This can be better compared against the population change data. In the mapping and figures section, I will attempt to discover a relationship amongst these variables for an economic interpretation for the my research question.

In [24]:
# First, convert the 'Year' column to a datetime object in state
state['Year'] = pd.to_datetime(state['Year'], format='%Y')

# Then, create a new column for the decade using the 'Year' column
state['Decade'] = state['Year'].apply(lambda x: int(x.year/10)*10)
state
Out[24]:
AverageTemperature AverageTemperatureUncertainty State Country Month Year Season Decade
dt
1900-01-01 6.286 0.537 Alabama United States 1 1900-01-01 Winter 1900
1900-02-01 6.523 0.520 Alabama United States 2 1900-01-01 Winter 1900
1900-03-01 11.213 0.372 Alabama United States 3 1900-01-01 Spring 1900
1900-04-01 17.103 0.313 Alabama United States 4 1900-01-01 Spring 1900
1900-05-01 21.311 0.350 Alabama United States 5 1900-01-01 Spring 1900
... ... ... ... ... ... ... ... ...
2013-05-01 10.607 0.208 Wyoming United States 5 2013-01-01 Spring 2010
2013-06-01 16.267 0.276 Wyoming United States 6 2013-01-01 Summer 2010
2013-07-01 20.222 0.133 Wyoming United States 7 2013-01-01 Summer 2010
2013-08-01 19.621 0.217 Wyoming United States 8 2013-01-01 Summer 2010
2013-09-01 15.811 1.101 Wyoming United States 9 2013-01-01 Fall 2010

69613 rows × 8 columns

In [ ]:
# Group the data by state and decade, and calculate the average temperature for each group
grouped_data = state.groupby(['State', pd.Grouper(key='Year', freq='10Y')])['AverageTemperature', 'Decade'].mean().reset_index()
In [ ]:
baseline_years = range(1900, 1951)
baseline_data = grouped_data[grouped_data['Decade'].isin(baseline_years)]
baseline_data
In [27]:
merged_data = pd.merge(grouped_data, baseline_data, on=['State'])
merged_data['TemperatureChange'] = ((merged_data['AverageTemperature_x'] - merged_data['AverageTemperature_y'])/merged_data['AverageTemperature_y'])*100
In [ ]:
# Group the data by state and year_x again, and calculate the average temperature change and std deviation for each group
final_data = merged_data.groupby(['State', 'Year_x']).agg({'TemperatureChange': ['mean', 'std']}).reset_index()
final_data.columns = ['State', 'Decade', 'AvgTemperatureChange', 'StdTemperatureChange']

# Pivot the data to get the desired output format
output_data2 = final_data.pivot(index='State', columns='Decade')
output_data2.columns = [col[0] + '_' + str(col[1].year) for col in output_data2.columns.values]
output_data2.reset_index(inplace=True)

# Rename the columns to remove 'TemperatureChange' from the column names
output_data2.columns = ['State'] + [col.replace('TemperatureChange', '') + 'Percent Change' for col in output_data2.columns[1:]]

# Display the output data
output_data2.head()
In [29]:
output_data2['State Abbreviation'] = series
In [30]:
output_data2.head()
Out[30]:
State Avg_1900Percent Change Avg_1910Percent Change Avg_1920Percent Change Avg_1930Percent Change Avg_1940Percent Change Avg_1950Percent Change Avg_1960Percent Change Avg_1970Percent Change Avg_1980Percent Change ... Std_1940Percent Change Std_1950Percent Change Std_1960Percent Change Std_1970Percent Change Std_1980Percent Change Std_1990Percent Change Std_2000Percent Change Std_2010Percent Change Std_2020Percent Change State Abbreviation
0 Alabama -0.872636 -2.127913 -0.963768 1.110638 1.512406 1.460351 1.509888 -1.512114 -0.015592 ... 1.569146 1.568341 1.569107 1.522394 1.545527 1.565402 1.585774 1.601200 1.668281 AL
1 Alaska 6.691152 14.107774 1.858280 -8.971326 -4.865070 -5.204364 4.168384 3.034927 1.893522 ... 7.947510 7.919166 8.702159 8.607470 8.512118 7.240847 6.908739 5.885135 6.785129 AK
2 Arizona 2.548125 -0.155804 -3.228203 -0.281522 1.585259 -0.270340 1.284533 -0.502561 -0.636477 ... 2.027394 1.990361 2.021392 1.985726 1.983054 2.035105 2.064589 2.104258 2.169206 AZ
3 Arkansas 0.750236 -2.639977 -1.936388 0.806073 2.339822 0.862966 1.753084 -0.803799 -0.427731 ... 1.963481 1.935146 1.952224 1.903168 1.910383 1.938636 1.968906 1.992814 2.114963 AR
4 California 2.190976 -0.532920 -3.301403 -0.240642 2.572075 -0.458355 1.316666 0.655140 0.537513 ... 2.204615 2.139481 2.177632 2.163414 2.160885 2.212351 2.239585 2.270909 2.325179 CA

5 rows × 28 columns

Ok, we can see now that we have columns that correpond to the temperature change in a given given decade in a state. I basically used the same strategy as 'ouput_data' to find the baseline years average temperature per decade and get the change of that against all other decades in a given state. I did this for all the baseline decades, comparing against each decade until 2010. I can plot figured to find patterns among these variables.

1.4 Plots/Histograms/Figures¶

For the part 2) visualization section, I have included that in this part instead because I think it would be more meaningful for my data plots to be side by side.¶

This is one of the most important parts of discovering patterns overtime to figure out season trends. Let's continue.

First, let's create hovermaps to visualize our {output_data} dataframe, and check for any seasonable trends/outlier states. I will create four maps per season to visualize these trends for the average temperature percent change columns, as well as standard deviation of the percent change columns per season.

In [31]:
# just temp
for season in ['Fall', 'Winter', 'Summer', 'Spring']:
    fig = px.scatter(output_data, x='State', y=f'Avg_{season}Percent Change', color=f'Avg_{season}Percent Change',
                     title=f'Percent Change in Temperature for {season}', hover_name='State', hover_data=[f'Avg_{season}Percent Change'])
    fig.show()

We can see that temperature changes the most during the winter months as compared to any other season. Also note an interesting trend: The percent change of the temperature during the summer months has historically remained almost above 0%. Meaning that, compared to the baseline years, the temperature change has increased. This means that during the summer months, temperature is actually increasing more than it was compared to the baseline year summer temperatures. The winters are the most volatile and have the highest standard deviation (graphs below), and overall, the temperature changes in fall and spring have also remained above 0% (except Alaska due it its geographic location).

In [32]:
# just std
for season in ['Fall', 'Winter', 'Summer', 'Spring']:
    fig = px.scatter(output_data, x='State', y=f'Std_{season}Percent Change', color=f'Std_{season}Percent Change',
                     title=f' Std. of Percent Change in Temperature for {season}', hover_name='State', hover_data=[f'Std_{season}Percent Change'])
    fig.show()

The standard deviation is lowest during the summer months and highest during winter. States such as Maine, Minnesota, Alaska, New Hamphire, Vermont and North Dakota have been the most volatile during any given seasons. Note that these are also the states that had a higher standard deviation than its mean temp in the summary statistics section.

In [33]:
pop.drop(['Geography Type', 'Number of Representatives', 'Change in Number of Representatives', 'Average Apportionment Population Per Representative'], axis=1, inplace = True)
In [34]:
pop.tail() # we basically want to merge this data with the output_data2 to see trends among the temperature change 
# variable per decade and the percent change in population per decade. However, here, merging is a bit difficult
# because we have columns of decade data per state in the output_data2 df, and we want to merge on state and decade
Out[34]:
State Year Resident Population Percent Change in Resident Population Resident Population Density Resident Population Density Rank
674 Virginia 2020 8,631,393 7.9 218.6 16.0
675 Washington 2020 7,705,281 14.6 115.9 24.0
676 West Virginia 2020 1,793,716 -3.2 74.6 31.0
677 Wisconsin 2020 5,893,718 3.6 108.8 27.0
678 Wyoming 2020 576,851 2.3 5.9 51.0
In [35]:
output_data2.head()
Out[35]:
State Avg_1900Percent Change Avg_1910Percent Change Avg_1920Percent Change Avg_1930Percent Change Avg_1940Percent Change Avg_1950Percent Change Avg_1960Percent Change Avg_1970Percent Change Avg_1980Percent Change ... Std_1940Percent Change Std_1950Percent Change Std_1960Percent Change Std_1970Percent Change Std_1980Percent Change Std_1990Percent Change Std_2000Percent Change Std_2010Percent Change Std_2020Percent Change State Abbreviation
0 Alabama -0.872636 -2.127913 -0.963768 1.110638 1.512406 1.460351 1.509888 -1.512114 -0.015592 ... 1.569146 1.568341 1.569107 1.522394 1.545527 1.565402 1.585774 1.601200 1.668281 AL
1 Alaska 6.691152 14.107774 1.858280 -8.971326 -4.865070 -5.204364 4.168384 3.034927 1.893522 ... 7.947510 7.919166 8.702159 8.607470 8.512118 7.240847 6.908739 5.885135 6.785129 AK
2 Arizona 2.548125 -0.155804 -3.228203 -0.281522 1.585259 -0.270340 1.284533 -0.502561 -0.636477 ... 2.027394 1.990361 2.021392 1.985726 1.983054 2.035105 2.064589 2.104258 2.169206 AZ
3 Arkansas 0.750236 -2.639977 -1.936388 0.806073 2.339822 0.862966 1.753084 -0.803799 -0.427731 ... 1.963481 1.935146 1.952224 1.903168 1.910383 1.938636 1.968906 1.992814 2.114963 AR
4 California 2.190976 -0.532920 -3.301403 -0.240642 2.572075 -0.458355 1.316666 0.655140 0.537513 ... 2.204615 2.139481 2.177632 2.163414 2.160885 2.212351 2.239585 2.270909 2.325179 CA

5 rows × 28 columns

In [36]:
pivot_population_data = pd.pivot_table(pop, values='Percent Change in Resident Population', index='State', columns='Year')

# Rename the columns to match the temperature data
new_columns = ['Avg_' + str(decade) + ' Pop Percent Change' for decade in pivot_population_data.columns]
pivot_population_data.columns = new_columns

# Merge the temperature data and the population data
merged_data = pd.merge(output_data2, pivot_population_data, on='State')
In [37]:
merged_data.head()
Out[37]:
State Avg_1900Percent Change Avg_1910Percent Change Avg_1920Percent Change Avg_1930Percent Change Avg_1940Percent Change Avg_1950Percent Change Avg_1960Percent Change Avg_1970Percent Change Avg_1980Percent Change ... Avg_1930 Pop Percent Change Avg_1940 Pop Percent Change Avg_1950 Pop Percent Change Avg_1960 Pop Percent Change Avg_1970 Pop Percent Change Avg_1980 Pop Percent Change Avg_1990 Pop Percent Change Avg_2000 Pop Percent Change Avg_2010 Pop Percent Change Avg_2020 Pop Percent Change
0 Alabama -0.872636 -2.127913 -0.963768 1.110638 1.512406 1.460351 1.509888 -1.512114 -0.015592 ... 12.7 7.1 8.1 6.7 5.4 13.1 3.8 10.1 7.5 5.1
1 Alaska 6.691152 14.107774 1.858280 -8.971326 -4.865070 -5.204364 4.168384 3.034927 1.893522 ... 7.7 22.3 77.4 75.8 32.8 33.8 36.9 14.0 13.3 3.3
2 Arizona 2.548125 -0.155804 -3.228203 -0.281522 1.585259 -0.270340 1.284533 -0.502561 -0.636477 ... 30.3 14.6 50.1 73.7 36.0 53.5 34.8 40.0 24.6 11.9
3 Arkansas 0.750236 -2.639977 -1.936388 0.806073 2.339822 0.862966 1.753084 -0.803799 -0.427731 ... 5.8 5.1 -2.0 -6.5 7.7 18.9 2.8 13.7 9.1 3.3
4 California 2.190976 -0.532920 -3.301403 -0.240642 2.572075 -0.458355 1.316666 0.655140 0.537513 ... 65.7 21.7 53.3 48.5 27.0 18.6 25.7 13.8 10.0 6.1

5 rows × 40 columns

In [38]:
for decade in range(1910, 2020, 10):
    fig = px.scatter(merged_data, x=f'Avg_{decade} Pop Percent Change', y=f'Avg_{decade}Percent Change', color='State')
    fig.update_layout(title=f'{decade}s', xaxis_title='Population Percent Change', yaxis_title='Temperature Percent Change')
    fig.show()

Here, let's check trends with the population data and temperature data. I did not perform a linear regression here beacause I did not have enough variables for my population dataframe, so I decided to do a scatterplot instead. I will most likely conduct a proper linear regression for another economic variable.

Here I graphed the temperature percent change in the y axis and the population percent change in the x axis.

In 1910, There is a general upward trend, indicating that there is a postiive relationships between population and temperature in this decade. However, there are states that may not have had a large temperature change but the population percent change is a lot larger. This initially makes sense since this decade denotes the industrial revolution, so the effect of increasing population would not be evident in the temperature yet, since we are looking at the baseline years to comapre temperature. There are also outliers, such as Alaska which has a large temp. change but little to no population change. This makes sense since this map doesn't take into account seasonailty to Alaska temp is variable. Therefore, Alaska will more likely be an outlier in most graphs due to seasonality.

Overall, there is a general trend: We can see the much hotter states such as California, Idaho, Florida, Nevada and Arizona have a consistently higher population percent change overtime, while their temperature change overtime remains consistently closer to 0 than most other states. This overall trend is consistent with our seasonal findings (which you will see in the mapping section): The hotter states exhibit an overall LESS volatile change in temperature as compared to the northern and nothereastern states, or 'cooler' states. Geographically speaking, this trend means that for mid-southern states, population changes may not be the most accurate determinant of temeprature change overtime; that the change can be denoted to the geographic location of the state. Again I have to reiterate, correlation does not equal causation, so we can't make any causal inferences regarding this observation. This can merely be ONE of the many variables that effect temperature overtime.

There is another very significant trend. As we can see, starting the 1990's for most states (again, Alaska being an outlier due to its geographical location), the temperature change veers more towards a positive increase, and less states have a mean temperature change (compared to the baseline years) that has REDUCED from the baseline year mean. Visually, this can be seen by the 1990, 2000, and 2010 scatterplots, where the temp change does not go into the negatives. This is so interesting to see this visual trend because it confirms something about our data, whether its relevent to the population variable or not (which in itself, is a good indicator of whether or not this is an economically significant variable in determining climate trends). We can see from these scatterplots that overtime, the temp change has gone from negative change to a positive change overtime, so there is definitely some form of climate warmth in the US in each state, which is especially seen from 1990 onwards.

Regarding population change factor, we note that there are some patterns that may be important such as historically warmer states being less sensitive to population changes overtime.

Overall, we really can't make a strong connection using a linear regression because we don't have enough variables for the population change overtime. If we had, for example, monthly data for percent change like we did for the temperature, then we could have run those as x and y variables side by side, so this is a pretty limited conclusion, but one still worth pursuing.

PROJECT TWO¶

We will now map the data of the summary statistics tables that we have generated earlier. As helpful as the scatterplots are, it will be slightly easier to visualize regional changes through a map.

2.1 THE MESSAGE¶

The message or purpose of my data is to answer my research question, stated above. This is, I want to find patterns in temperature changes overtime for the US states and somehow find economic variables that may be a good determinant of these temperature changes over certain regions. I also want to decipher if economic variables can potentially be more or less significant indicators compared to the geographic locations of any given state in the US. I think this can be economically significant since this research can aid policymakers to take into account climate when they are implementing economic policies.

Recall the {merged_stats} dataset in the Summary Statistics section, that gave us the average temperature and standard deviation of temperature per state, overall. The first visualization I want to create is a map to add meaning to my "message". Using colour to indicate which states have a higher standard deviation that its mean temperature using {merged_stats}, I can specifically focus on these states, as they may have more distinct temperature patterns. I want to do this to show, without taking into account seasonality or annual variability, how certain states may behave as benchmark historically.

In [39]:
merged_stats['State Abbreviation'] = series
In [40]:
us_states = gpd.read_file('cb_2018_us_state_500k.shp')
merged_df = us_states.merge(merged_stats, left_on='STUSPS', right_on='State Abbreviation')
subset_df = merged_df[merged_df['std'] > merged_df['mean_temp']]

fig, ax = plt.subplots(1, figsize=(60, 40), subplot_kw={'aspect': 'equal'})
#ax.axis('off')
ax.set_xlim([-130, -65])
ax.set_ylim([23, 50])
ax.set_title('US States - STD > Mean Temp', fontdict={'fontsize': '25', 'fontweight' : '3'})

merged_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax, edgecolor='0.8', 
               legend=True,legend_kwds={'label': "Standard Deviation (degrees Celcius)", 'shrink': 0.3})

subset_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax, edgecolor='0.8',
               alpha=0.7, legend=False)

# Set color for all other states
other_df = merged_df[~merged_df.isin(subset_df)].dropna()
other_df.plot(color='white', linewidth=0.8, ax=ax, edgecolor='0.8')

plt.show()
In [41]:
fig2, ax2 = plt.subplots(1, figsize=(10, 20), subplot_kw={'aspect': 'equal'})
#ax.axis('off')
ax2.set_xlim([-180, -130])
ax2.set_ylim([50, 72])
ax2.set_title('US States - STD > Mean Temp', fontdict={'fontsize': '25', 'fontweight' : '3'})

merged_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8', 
               legend=True,legend_kwds={'label': "Mean Temp (degrees Celcius)", 'shrink': 0.3})

subset_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8',
               alpha=0.7, legend=False)

# Set color for all other states
other_df = merged_df[~merged_df.isin(subset_df)].dropna()
other_df.plot(color='white', linewidth=0.8, ax=ax2, edgecolor='0.8')

plt.show()
In [42]:
# plot the overall standard devation in the states overtime - benchmark. 
fig, ax = plt.subplots(1, figsize=(80, 60), subplot_kw={'aspect': 'equal'})
ax.set_xlim([-130, -65])
ax.set_ylim([23, 50])

# Plot the map
merged_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax, edgecolor='0.8', 
               legend=True, legend_kwds={'label': "Standard Deviation (degrees Celcius)", 'shrink': 0.5})

# Create a second map that includes Alaska
fig2, ax2 = plt.subplots(1, figsize=(20, 20), subplot_kw={'aspect': 'equal'})
ax2.set_xlim([-180, -130])
ax2.set_ylim([50, 72])

# Plot the map with Alaska
merged_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8', 
               legend=True, legend_kwds={'label': "Standard Deviation (degrees Celcius)", 'shrink': 0.5})

# Add Alaska to the second map
alaska_df = merged_df[merged_df['STUSPS'] == 'AK']
alaska_df.plot(column='std', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8', alpha=0.7, legend=False)

plt.show()
In [43]:
# plot the overall temperature in the states overtime - benchmark. 
fig, ax = plt.subplots(1, figsize=(80, 60), subplot_kw={'aspect': 'equal'})
ax.set_xlim([-130, -65])
ax.set_ylim([23, 50])

# Plot the map
merged_df.plot(column='mean_temp', cmap='OrRd', linewidth=0.8, ax=ax, edgecolor='0.8', 
               legend=True, legend_kwds={'label': "Average Temperature (degrees Celcius)", 'shrink': 0.5})

# Create a second map that includes Alaska
fig2, ax2 = plt.subplots(1, figsize=(20, 20), subplot_kw={'aspect': 'equal'})
ax2.set_xlim([-180, -130])
ax2.set_ylim([50, 72])

# Plot the map with Alaska
merged_df.plot(column='mean_temp', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8', 
               legend=True, legend_kwds={'label': "Average Temperature (degrees Celcius)", 'shrink': 0.5})

# Add Alaska to the second map
alaska_df = merged_df[merged_df['STUSPS'] == 'AK']
alaska_df.plot(column='mean_temp', cmap='OrRd', linewidth=0.8, ax=ax2, edgecolor='0.8', alpha=0.7, legend=False)

plt.show()
In [44]:
print('States that have higher temperature variability in terms of their mean temp overtime:')
for state in subset_df['State'].values:
    print(state)
States that have higher temperature variability in terms of their mean temp overtime:
Michigan
Massachusetts
Idaho
Nebraska
South Dakota
Colorado
Utah
Wyoming
New York
Alaska
Vermont
Montana
Iowa
New Hampshire
Maine
Wisconsin
North Dakota
Minnesota

The map depicts all the states, on average (exluding seasonality trends), that have st. dev. larger than its state mean temp overtime using averges from 1900-2013. We can use this map to roughly estimate that these states (listed above) may be the most sensitive in terms of temperature change and variability, in terms of geographic location.

However, this is just a rough estimate, since we do not take into account seasonality. Also using the mean_temp and standard deviation maps, we can see that colder states have higher standard deviation of temp over average. We will look out for the same patterns when mapping in the next section.

We also note that in our previous graph analysis, we noted that most of these states states above had more temperature variability in any given month. From this we can infer that northern states, indeed, have more volatile changes in temperature overtime compared to the baselien temperatures.

2.2 Maps and Interpretations¶

Seasonal Trends:¶

In [45]:
# here we will use our <output_data> dataframe that we created for seasonal trend. 
output_data.head()
Out[45]:
State Avg_FallPercent Change Avg_SpringPercent Change Avg_SummerPercent Change Avg_WinterPercent Change Std_FallPercent Change Std_SpringPercent Change Std_SummerPercent Change Std_WinterPercent Change State Abbreviation
0 Alabama 0.524667 1.207958 0.937016 -0.630510 27.929518 22.554549 4.194436 28.901118 AL
1 Alaska -1.654958 -3.666642 2.002998 -1.936255 160.516877 124.157545 13.571873 20.131822 AK
2 Arizona 1.308517 1.192448 1.130993 3.066347 33.406190 26.873746 6.067800 33.178413 AZ
3 Arkansas 0.393027 1.642405 0.990567 0.658951 32.052797 27.365292 5.894152 46.013985 AR
4 California 1.900165 0.985690 1.343946 3.938808 29.331947 23.579841 8.125712 25.076841 CA
In [46]:
for season in ['Fall', 'Winter', 'Summer', 'Spring']:
    fig = px.choropleth(output_data, locations='State Abbreviation', locationmode="USA-states",
                        color=f'Avg_{season}Percent Change', scope="usa",
                        color_continuous_scale="RdBu",range_color=(-5, 5),
                        title=f'Percent Change in Temperature for {season}')
    fig.update_layout(geo=dict(bgcolor= 'rgba(0,0,0,0)'))
    fig.show()

As we can see with our maps, the temperature change from the baseline years until 2013 is most visible in the winter seasons. This is consistent with our regression findings earlier. It is even more interesting to see that the northearn states have a temperature change that is lower than the winter baseline average for that given state. This means that compared to the baseline historical average of these states in the winter months until 1950, the winter average temperature in the winters has decreased more on average. Some of these states are also the ones we mapped earlier that have a large standard deviation in comparison to its mean. Since that map did not consider seasonality, the variability in the seasonal changes may have caused the standard deviation of these states to be high in the preiovus map. That can be one potential explanation.

The temperature change in spring is also quite different. The northern states, again, exhibit a larger temperature change. Compared to the baselinea years average temperature in the spring months, the average temperature has increased more on average, in these northern states. We can note that these states are cooler over average, and are closer to colder bodies of water (The Great Lakes and the Atlantic Ocean) which may get cooler/warmer during the spring due to a drastic shift in temperature. In these northern states, the changing of season's from cold to hot can take longer, therefore, the shift in temperature of the bodies of water and the states may be nore variable, reflecting that.

In [47]:
for season in ['Fall', 'Winter', 'Summer', 'Spring']:
    fig = px.choropleth(output_data, locations='State Abbreviation', locationmode="USA-states",
                        color=f'Std_{season}Percent Change', scope="usa",
                        color_continuous_scale="RdBu",
                        title=f'Percent Change in Standard Deviation of Temperature for {season}')
    fig.update_layout(geo=dict(bgcolor= 'rgba(0,0,0,0)'))
    fig.show()

The standard deviations for the states per season differ greatly overtime. During the winter, all the states have larger temperature variability, compared to all other seasons. In the summer, fall, and spring months, the northern states have more variability in temperature, as well.

These results can indicate that geographic location has a huge part to play in how temperature may come to behave overtime. While southern, more warmer states may experience less temperature variability through any given season, they also have systemtically higher temperatures. Also, while the northern states have lower temperatures, they experience a lot more variation during any given season. While using economic indicators such as C02 emissions or precipitation may aid our understanding of these temp changes, it is important to take notice of the geographic factors as well.

Aside: I am going to map the AverageTemperatureUncertainty for each state . The reason for this is because I want to check whether all my findings so far are accurate in terms of historic temperature measurements and information availability. I want to check that the AverageTemperatureUncertainty - which measures the temperature measurement accuracy overtime. I want to check that the accuracy of the measuring temperature overtime has indeed increased, due to better technology, information availabilty and climate change awareness overtime.¶
In [48]:
temp_df = pd.read_csv("GlobalLandTemperaturesByState.csv", parse_dates=["dt"])

# Filter to mainland US states
mainland_states = ["Alabama", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]
mainland_df = temp_df[temp_df.State.isin(mainland_states)].reset_index(drop=True)

# Calculate the baseline period average temperatures for each state
baseline_start = "1900-01-01"
baseline_end = "1951-01-01"
baseline_temps = mainland_df[(mainland_df.dt >= baseline_start) & (mainland_df.dt <= baseline_end)].groupby("State").mean().reset_index()[["State", "AverageTemperatureUncertainty"]]
baseline_temps.columns = ["State", "BaselineTemp"]

# Merge the baseline temperatures with the temperature data
mainland_df = pd.merge(mainland_df, baseline_temps, on="State")

# Calculate the temperature change from the baseline
mainland_df["TempChange"] = mainland_df["AverageTemperatureUncertainty"] - mainland_df["BaselineTemp"]

# Group by state and calculate the mean temperature change per year
mainland_grouped = mainland_df.groupby(["State", pd.Grouper(key="dt", freq="Y")])["TempChange"].mean().reset_index()
mainland_grouped["Year"] = mainland_grouped["dt"].dt.year

# Pivot the data to have states as rows and years as columns
mainland_pivot = mainland_grouped.pivot(index="State", columns="Year", values="TempChange")

# Load the state shapefile
us_states_df = gpd.read_file("http://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_us_state_5m.zip")
us_states_df = us_states_df.rename(columns={"NAME": "State"})

# Merge the temperature data with the state shapefile
merged = us_states_df.merge(mainland_pivot, on="State")

# Create the map
fig, ax = plt.subplots(figsize=(20, 15))

merged.plot(column=2013, cmap="coolwarm", linewidth=0.8, ax=ax, edgecolor="0.8")

ax.set_title("Mainland US Average Temperature Uncertainty Change 1900-2013", fontdict={"fontsize": "16", "fontweight" : "bold"})
ax.set_axis_off()

# Create the legend
sm = plt.cm.ScalarMappable(cmap="coolwarm", norm=plt.Normalize(vmin=merged[2013].min(), vmax=merged[2013].max()))
sm._A = []
cbar = fig.colorbar(sm)

plt.show()
/var/folders/ns/hk2tz_596ks4k9jhksn8vlkm0000gn/T/ipykernel_22705/936923971.py:44: MatplotlibDeprecationWarning:

Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.

Bluer colored states represent a lower increase in the average temperature uncertainty from the baseline period to the year specified in the merged.plot() function. This means that the range of possible temperature values for these states became smaller over time, indicating more reliable temperature measurements.

However, it is important to note that the range of possible temperature values is just one aspect of uncertainty in temperature measurements, and it does not necessarily mean that the actual temperature change was smaller in these states. There could be other sources of uncertainty in the temperature measurements that are not captured by this analysis.

As we can see there is a larger cluster of states in the midwest that have a higher temperature measurement uncertaintly compared to some of the north-eastern states, indicating that the states in the north regions have more reliable temperature measurements.

One possible explanation is that the Midwestern states have a relatively stable climate with less variation in temperature over time compared to other regions, which could lead to less uncertainty in temperature measurements. However this differs from the map generated above. However, just because the temperature change of the northwest states has increased more overtime does not mean that that the temperature itself is stable. It could just mean that for these states, the temperature measurements may have immproved frrom the baseline year for a more accurate measurement. Additionally, the Midwestern states may have had more consistent and reliable temperature monitoring stations over time, leading to more consistent measurements and less uncertainty.

On the other hand, the Northeastern region has more complex topography, with mountains and coastlines that can influence local weather patterns, and may have had more variation in temperature over time, leading to higher uncertainty in temperature measurements. Additionally, the Northeastern region is more densely populated and urbanized, which can create heat islands and other local climate effects that can further complicate temperature measurements.

It is important to note that these are just potential explanations and other factors could also be contributing to the observed patterns. This map only shows the temperature measurement changes not the temperature changes itself, so it is important to keep that into account. My reasoning for including this map is to show that temperature monitoring has drastically improved overtime in states that may not have been previously less populated. This is important to understand since we need to know to what extent the temperature measurements are certain for our findings.

Population and Temperature Mapping Analysis:¶

In [49]:
# let's map the <output_data2> and <merged_data> to map the temperature change and popualtion change.
for year in range(1910, 2020, 10):
    fig = px.choropleth(merged_data, locations='State Abbreviation', locationmode="USA-states",
                        color=f'Avg_{year}Percent Change', scope="usa",
                        color_continuous_scale="Reds",
                        title=f'Average of Temperature Percent Change for Decade: {year}')
    #fig.update_layout(geo=dict(bgcolor= 'rgba(0,0,0,0)'))
    fig.show()
# I used red here as a scale because i wanted to show the dramatic increase in temp change overtime. 

As we can see, decade by decade the percent change in temp from the baseline years increases, especially during the last 3 decades. We can see this because the mapping scale increases into the double digits those 3 decades for most states as compared to the previous decades where the temp percent change may be only increasing/decreasing a lot more for only a few states. The magnifude of the changes has increased, while in most other decades, the percent change in temperature hits the double digits and we can see this in the dramatic colour change. There is a large increase in temp in the 1980-1990 decade. We can also see that the northern states experience more of a positive temperature change overtime in each decade. Climate change has definitely been at more of a forefront since 1990 and onward, compared with the decades before.

Some states may be an exception due to their geographic location - for example. Here, we can see that since the northern states exhibit a larger change overtime in each decade because 1) They are cooler on average and may be more variable since the df does not take into account seasonality (Alaska, the Dakota's, Michigan), 2) They experience more changes in precipitation due to their proximity to the ocean so this could also effect temp. analysis (Maine, Vermont, New Hampshire, etc). These states are also more variable on average (we also saw this in the standard deviation graphs previously, so our analysis is consistent with that one - and our population scatter analysis which states that warmer states are less volatile and less sensitive to changes in population). To confirm this, we can also just take the our original state data and groupby the year and plot it.

PROJECT THREE¶

3.1 Potential Data to Scrape¶

Potential data I would like to scrape would be monthly C02 emissions per state or industrialization percentage change per state overtime. Preferrable if the dataset was large and dated back a long time, it would be very meaningful for my research question. Since both are important factors in determining temperature overtime, with a large dataset, regressions could be run that could add more meaning to my question. Monthly datasets would be more meaningful than aggregating outwards and getting annual values because my temperature dataset is in monthly frequency. Although, it might be difficult to find a dataset that perfectly complements my research question. Some datasets that I found for the US are useful but they are not in monthly frequency and some of them don't have specific state data, just country data.

Here is a potential source that complements my research: We will use EIA's website - US Energy Information Administration to find the dataset. https://www.eia.gov/environment/emissions/state/ - for co2 energy related carbon emissions by state dating back 1970 -2020. This is the best dataset I could find that fits my research question that I can potentially web scrape. This website uses API key access and an API url for web scraping, so I can web scrape this data.

I can creata a dataframe out of this dataset, and merge it with my aggregated data on an annual basis. That way, they are both of the same frequency. I can look at the c02 emissions of each state and compare it with the temperature of that state and how they both change overtime, and how the temperature might reflect the change in the c02 emissions. I am hoping that there is a correlation amongst the two. I wish to uncover that overtime, c02 emissions can be a function of the temperature overtime, that they are positively correlated. This is important to my research question because I can discover whether or not c02 emissions most effect a state's temperature overtime, compared to geographic location or other economic variables.

Alternatively, I could use the two datasets separately to conduct parallel analyses and then compare the results. For instance, I could analyze the temperature data to look for trends or anomalies and then see if these correspond to any patterns in the carbon dioxide emissions data.

Overall, combining the temperature and carbon dioxide emissions datasets could help me to gain a more comprehensive understanding of the environmental factors that influence carbon dioxide emissions and climate change.

3.2 Potential Challenges¶

The data source above is one of the only few that I found that shows c02 emissions by state and dates back quite a while ago to make for a longer dataset. The reason I ended up choosing this dataset, even though it does not have monthly data is because it allows me to use an API key to conduct API web scraping. Other websites that have similar data would either not allow me to conduct web-scraping or did not have data as large as this dataset. Also scraping this dataset takes a very short amount of time, which makes it more convenient to use for a project such as this which has time limitations.

Potential Challenges that I face with scraping these sources are:

  • The data is not monthly, so if I outer merge with my current dataset, it will not be as meaningful if had monthly observations I could track overtime. But every dataset I found has certain trade-off's. Some datasets have longer time frames, but don't have state data, some have state data, but the time frame is not as large, and for most of them, web-scraping is tricky. This is the best dataset that I found under these circumstances.

  • I need to look out for the c02 emissions values in the dataset, meaning that i need to make a meaningful comparison against temperature so it would be difficult to maybe get the change in c02 emissoins from one year to another. I think I would potentially need to group the data by state, then the fuel_name and then get the change per year for each type of fuel_name and then see if there are certain fuel emissions that are better predictors of temperature. This can be technically challenging to think about and code up, especially for running regressions.

  • The EIA's API does not allow me to scrape more than 5000 rows, so I will have to use the 'offset' parameter to specify the starting row for the next API request. I can start the next request at row 5000 (the maximum number of rows returned by the API in a single request) and then increment the offset by 5000 for each subsequent request until I have all the data. I will most likely create a while loop for this part and keep iterating until I don't get anything from the API.

Let's scrape!

3.3 Scraping Data from a Website¶

Here, we will scrape the data using API-web scraping, using the EIA's API personal access key. Here is the API link that I used to get the data: https://www.eia.gov/opendata/index.php

In [ ]:
import requests
import json

offset = 0
# we will first set offset to 0, indicating we want the first 5000 rows, then keep updating it in each loop 
# iteration and adding it in our dataframe to get the next 5000 rows -> [0, 5000] -> [5000, 10,000] etc. 

rows = []
# rows is an empty list at first 

while True:
    # here is the API url. I had to sign up and get the access key and input it into the 
    # url in a very specific manner. Then near the end of the url, I am using the f"" function to update the 
    # offset parameter per iteration so the API knows which rows to give me. 
    
    url = f'https://api.eia.gov/v2/co2-emissions/co2-emissions-aggregates/data/?api_key=qo4UGF5ygDuaa28f53pLo4QD6v2rgPGOMMJZ6mc5&frequency=annual&data[0]=value&facets[stateId][]=AK&facets[stateId][]=AL&facets[stateId][]=AR&facets[stateId][]=AZ&facets[stateId][]=CA&facets[stateId][]=CO&facets[stateId][]=CT&facets[stateId][]=DC&facets[stateId][]=DE&facets[stateId][]=FL&facets[stateId][]=GA&facets[stateId][]=HI&facets[stateId][]=IA&facets[stateId][]=ID&facets[stateId][]=IL&facets[stateId][]=IN&facets[stateId][]=KS&facets[stateId][]=KY&facets[stateId][]=LA&facets[stateId][]=MA&facets[stateId][]=MD&facets[stateId][]=ME&facets[stateId][]=MI&facets[stateId][]=MN&facets[stateId][]=MO&facets[stateId][]=MS&facets[stateId][]=MT&facets[stateId][]=NC&facets[stateId][]=ND&facets[stateId][]=NE&facets[stateId][]=NH&facets[stateId][]=NJ&facets[stateId][]=NM&facets[stateId][]=NV&facets[stateId][]=NY&facets[stateId][]=OH&facets[stateId][]=OK&facets[stateId][]=OR&facets[stateId][]=PA&facets[stateId][]=RI&facets[stateId][]=SC&facets[stateId][]=SD&facets[stateId][]=TN&facets[stateId][]=TX&facets[stateId][]=US&facets[stateId][]=UT&facets[stateId][]=VA&facets[stateId][]=VT&facets[stateId][]=WA&facets[stateId][]=WI&facets[stateId][]=WV&facets[stateId][]=WY&start=1970&end=2020&sort[0][column]=period&sort[0][direction]=asc&offset={offset}&length=5000'
    
    response = requests.get(url)
    
    print(response.status_code)
    # here, i am making sure that each iteration gets me the data i asked for. I want it to give me 
    # 200 as a response, telling me that I can successfully fetch the data I asked for in this iteration. 
    
    # now i'm using the json library to give it the data in a dictionary format using the data text.
    api_response = json.loads(response.text)

    # Extract the relevant data from the dictionary
    
    data = api_response['response']['data']
    
    # if there is no more data, we are done, so end the loop.
    if not data:
        break

    # Loop through the data and extract the required columns for each row, where each item is a small dict object. 
    for item in data:
        row = {'period': item['period'],
               'sector_name': item['sector-name'],
               'fuel_name': item['fuel-name'],
               'state': item['state-name'],
               'State Abbreviation': item['stateId'],
               'value': item['value'],
               'value_units': item['value-units']}

        # Add the row to the list of rows if it's not a duplicate (as a precaution)
        if row not in rows:
            rows.append(row)

    # Increment the offset by 5000 for the next API request
    offset += 5000

api_df = pd.DataFrame(rows)
In [152]:
api_df
Out[152]:
period sector_name fuel_name state State Abbreviation value value_units
0 1970 Residential carbon dioxide emissions Coal Wisconsin WI 1.464746 million metric tons of CO2
1 1970 Total carbon dioxide emissions from all sectors Natural Gas Ohio OH 56.212733 million metric tons of CO2
2 1970 Total carbon dioxide emissions from all sectors Petroleum Ohio OH 71.253559 million metric tons of CO2
3 1970 Total carbon dioxide emissions from all sectors Coal Ohio OH 147.878396 million metric tons of CO2
4 1970 Total carbon dioxide emissions from all sectors All Fuels Ohio OH 275.344688 million metric tons of CO2
... ... ... ... ... ... ... ...
63031 2020 Industrial carbon dioxide emissions All Fuels North Dakota ND 16.016110 million metric tons of CO2
63032 2020 Total carbon dioxide emissions from all sectors All Fuels North Dakota ND 54.251280 million metric tons of CO2
63033 2020 Total carbon dioxide emissions from all sectors Coal North Dakota ND 34.793557 million metric tons of CO2
63034 2020 Total carbon dioxide emissions from all sectors Petroleum North Dakota ND 11.400245 million metric tons of CO2
63035 2020 Total carbon dioxide emissions from all sectors Natural Gas North Dakota ND 8.057478 million metric tons of CO2

63036 rows × 7 columns

Now we have our dataset that we have successfully scraped into our project.

This data provides information on carbon dioxide emissions for each state in the United States from 1970 to 2020. The dataset includes information on the amount of emissions produced by various sectors, such as residential and transportation, and from different types of fuels, such as coal and natural gas. The emissions are measured in million metric tons of CO2, and the dataset also includes information on the units of measurement for each value. This data can be used to analyze trends in carbon dioxide emissions over time and to compare emissions across different states and sectors.

3.4 Merging the Scraped Data¶

Since this data annual I will use one of the datasets that I created when I was creating the , since I aggregated the data annually for that section.

In [155]:
api_df
api_df.rename(columns={'period': 'year'}, inplace=True)
api_df.rename(columns={'state': 'State'}, inplace=True)
In [168]:
api_df.head(30) # this is what the dataframe looks like. 
Out[168]:
year sector_name fuel_name State State Abbreviation value value_units
0 1970 Residential carbon dioxide emissions Coal Wisconsin WI 1.464746 million metric tons of CO2
1 1970 Total carbon dioxide emissions from all sectors Natural Gas Ohio OH 56.212733 million metric tons of CO2
2 1970 Total carbon dioxide emissions from all sectors Petroleum Ohio OH 71.253559 million metric tons of CO2
3 1970 Total carbon dioxide emissions from all sectors Coal Ohio OH 147.878396 million metric tons of CO2
4 1970 Total carbon dioxide emissions from all sectors All Fuels Ohio OH 275.344688 million metric tons of CO2
5 1970 Industrial carbon dioxide emissions All Fuels Ohio OH 104.045393 million metric tons of CO2
6 1970 Industrial carbon dioxide emissions Natural Gas Ohio OH 19.505874 million metric tons of CO2
7 1970 Industrial carbon dioxide emissions Petroleum Ohio OH 15.917211 million metric tons of CO2
8 1970 Industrial carbon dioxide emissions Coal Ohio OH 68.622309 million metric tons of CO2
9 1970 Electric Power carbon dioxide emissions All Fuels Ohio OH 77.417981 million metric tons of CO2
10 1970 Electric Power carbon dioxide emissions Natural Gas Ohio OH 1.159653 million metric tons of CO2
11 1970 Electric Power carbon dioxide emissions Petroleum Ohio OH 0.669479 million metric tons of CO2
12 1970 Electric Power carbon dioxide emissions Coal Ohio OH 75.588849 million metric tons of CO2
13 1970 Transportation carbon dioxide emissions All Fuels Ohio OH 47.665019 million metric tons of CO2
14 1970 Transportation carbon dioxide emissions Natural Gas Ohio OH 0.650449 million metric tons of CO2
15 1970 Transportation carbon dioxide emissions Petroleum Ohio OH 46.909565 million metric tons of CO2
16 1970 Transportation carbon dioxide emissions Coal Ohio OH 0.105005 million metric tons of CO2
17 1970 Commercial carbon dioxide emissions All Fuels Ohio OH 13.094867 million metric tons of CO2
18 1970 Commercial carbon dioxide emissions Natural Gas Ohio OH 9.948143 million metric tons of CO2
19 1970 Commercial carbon dioxide emissions Petroleum Ohio OH 1.579323 million metric tons of CO2
20 1970 Commercial carbon dioxide emissions Coal Ohio OH 1.567402 million metric tons of CO2
21 1970 Residential carbon dioxide emissions All Fuels Ohio OH 33.121427 million metric tons of CO2
22 1970 Residential carbon dioxide emissions Natural Gas Ohio OH 24.948614 million metric tons of CO2
23 1970 Residential carbon dioxide emissions Petroleum Ohio OH 6.177982 million metric tons of CO2
24 1970 Residential carbon dioxide emissions Coal Ohio OH 1.994831 million metric tons of CO2
25 1970 Total carbon dioxide emissions from all sectors Natural Gas Indiana IN 28.458854 million metric tons of CO2
26 1970 Total carbon dioxide emissions from all sectors Petroleum Indiana IN 49.185580 million metric tons of CO2
27 1970 Total carbon dioxide emissions from all sectors Coal Indiana IN 94.287769 million metric tons of CO2
28 1970 Total carbon dioxide emissions from all sectors All Fuels Indiana IN 171.932203 million metric tons of CO2
29 1970 Industrial carbon dioxide emissions All Fuels Indiana IN 73.877176 million metric tons of CO2
In [163]:
state = state.drop(columns='Year')
In [ ]:
state_data = state.groupby(['State', 'year'])[['AverageTemperature']].mean().reset_index()
state_data
In [165]:
state_df = state_data.loc[(state_data['year'] >= 1970) & (state_data['year'] <= 2013)]


# merge the two dataframes on the common column 'State'
merged_df = pd.merge(api_df, state_df, on=['year', 'State'])

merged_df
Out[165]:
year sector_name fuel_name State State Abbreviation value value_units AverageTemperature
0 1970 Residential carbon dioxide emissions Coal Wisconsin WI 1.464746 million metric tons of CO2 6.248417
1 1970 Total carbon dioxide emissions from all sectors Natural Gas Wisconsin WI 17.916791 million metric tons of CO2 6.248417
2 1970 Total carbon dioxide emissions from all sectors Petroleum Wisconsin WI 34.044656 million metric tons of CO2 6.248417
3 1970 Total carbon dioxide emissions from all sectors Coal Wisconsin WI 36.224345 million metric tons of CO2 6.248417
4 1970 Total carbon dioxide emissions from all sectors All Fuels Wisconsin WI 88.185792 million metric tons of CO2 6.248417
... ... ... ... ... ... ... ... ...
51327 2013 Commercial carbon dioxide emissions Coal Wisconsin WI 0.082526 million metric tons of CO2 8.085333
51328 2013 Residential carbon dioxide emissions All Fuels Wisconsin WI 9.756511 million metric tons of CO2 8.085333
51329 2013 Residential carbon dioxide emissions Natural Gas Wisconsin WI 7.787419 million metric tons of CO2 8.085333
51330 2013 Residential carbon dioxide emissions Petroleum Wisconsin WI 1.969093 million metric tons of CO2 8.085333
51331 2013 Residential carbon dioxide emissions Coal Wisconsin WI 0.000000 million metric tons of CO2 8.085333

51332 rows × 8 columns

Ok we have successfully merged the {state} dataset with the {api_df} dataset. As we can see here, we have 51,332 rows and 8 columns. When I grouped the data by state (which is what I need to do to get relevent information for my research question), the number of rows per state are around 1056 observations. So, to run regressions the number of observations, this number seems suitable since there are enough observations for state analysis and regional analysis - where I can potentially group certain states together and analyze it against emissions. This way, I can get more observations if I group by region instead of state.

3.5 Visualizing the Scraped Data¶

In [201]:
visual = merged_df.groupby('sector_name')
In [205]:
for group in visual.groups:
    thing = visual.get_group(group)
    plt.style.use('seaborn')
    thing.plot(x='year', y='value', kind='scatter')
    plt.title(f"{group}")
    plt.show()

Since I want to create regressions, it is important to note how each sector's CO2 emissions behaves overtime. I want to see if I can potentially fit linear regression models against temperature to get a linear trend. As we can see, there are a lot of spikes in these charts, so there may be some states that use more energy for a given sector. More generally, it seems that these graphs can be skewed, but most have a near linear trend. Since this data is smaller, it may be hard to denote non linear patterns. Let's move onto finding some regression results.

Regressions for CO2 emissions dataset and temperature dataset¶

In [211]:
merged_df['Residential carbon dioxide emissions'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Residential carbon dioxide emissions' else 0) 
merged_df['Industrial carbon dioxide emissions'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Industrial carbon dioxide emissions' else 0) 
merged_df['Electric Power carbon dioxide emissions'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Electric Power carbon dioxide emissions' else 0)
merged_df['Transportation carbon dioxide emissions'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Transportation carbon dioxide emissions' else 0)
merged_df['Commercial carbon dioxide emissions'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Commercial carbon dioxide emissions' else 0)
merged_df['Total carbon dioxide emissions from all sectors'] = merged_df['sector_name'].apply(lambda x: 1 if x == 'Total carbon dioxide emissions from all sectors' else 0)

Here we can't groupby season since the CO2 data is on an annual frequency. The first regression will simply be on a per state basis and we will use the CO2 emission values along with the year variable to predict the temperature to see how if CO2 emissions can predict temperatures is any given state.

Here is the model:

$$ {AverageTemperature}_i = \beta_0 + \beta_1 {year}_i + \beta_2 {CO2emissions}_i + u_i $$

where:

  • $ \beta_0 $ is the intercept of the linear trend line on the y-axis
  • $ \beta_1 $ is the year variable explaining how much temperature increases by (in Celcius) annually, on average in that state
  • $ \beta_{2} $ is the CO2 emission value in that state overall.
  • $ u_i $ is a random error term (deviations of observations from the linear trend due to factors not included in the model)
In [170]:
grouped_state = merged_df.groupby('State')
In [198]:
lst10 = []
for group in grouped_state.groups:
    lst10.append(group)
In [235]:
lst4 = []
for groups in grouped_state.groups:
    state3 = grouped_state.get_group(groups)
    X = state3[['year', 'value']]
    Y = state3[['AverageTemperature']]
 
    X = sm.add_constant(X)
    model = sm.OLS(Y, X).fit()
    lst4.append(model)
    predictions = model.predict(X) 
In [236]:
stargazer10 = Stargazer(lst4)
stargazer10.custom_columns(lst10, lst[:49])
stargazer10.show_model_numbers(False)
stargazer10.show_degrees_of_freedom(False)

HTML(stargazer10.render_html())
Out[236]:
Dependent variable:AverageTemperature
AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
const-34.821***-87.222***-51.527***-43.380***-35.163***-55.716***-55.750***-45.238***-14.253***-2.018-52.841***-49.512***-53.708***-53.000***-48.601***-46.563***-36.916***-69.857***-43.091***-56.049***-67.498***-64.629***-37.491***-46.884***-58.270***-53.030***-45.688***-68.141***-50.541***-54.889***-65.809***-38.398***-56.357***-51.969***-45.914***-37.221***-53.651***-55.889***-30.463***-54.865***-42.925***-50.388***-52.641***-71.104***-41.848***-45.037***-45.646***-65.949***-60.487***
(2.319)(4.658)(2.193)(2.652)(2.227)(2.554)(2.735)(2.541)(2.048)(1.637)(3.261)(3.280)(3.222)(4.006)(3.220)(2.784)(2.160)(3.292)(2.541)(2.990)(3.538)(4.467)(2.287)(3.257)(4.258)(3.815)(2.749)(3.136)(2.652)(2.156)(3.083)(2.118)(5.055)(3.158)(2.791)(2.612)(2.835)(2.726)(2.306)(4.605)(2.546)(2.324)(2.886)(3.161)(2.404)(2.877)(2.632)(3.903)(3.348)
value-0.0000.0020.0000.0000.000-0.000-0.000-0.001-0.0000.0010.002-0.000-0.000-0.001-0.001-0.000-0.000-0.004-0.000-0.000-0.000-0.001-0.000-0.0000.0010.0000.000-0.005-0.000-0.000-0.000-0.0000.000-0.000-0.0000.000-0.000-0.001-0.000-0.005-0.000-0.000-0.000-0.006-0.0000.000-0.000-0.0000.000
(0.001)(0.004)(0.001)(0.001)(0.000)(0.001)(0.002)(0.004)(0.000)(0.002)(0.006)(0.000)(0.000)(0.002)(0.001)(0.001)(0.000)(0.004)(0.001)(0.001)(0.001)(0.002)(0.001)(0.001)(0.004)(0.003)(0.002)(0.005)(0.001)(0.001)(0.000)(0.000)(0.003)(0.000)(0.001)(0.002)(0.000)(0.006)(0.001)(0.010)(0.001)(0.000)(0.001)(0.012)(0.001)(0.001)(0.001)(0.001)(0.001)
year0.026***0.042***0.034***0.030***0.025***0.032***0.033***0.029***0.018***0.012***0.029***0.031***0.033***0.031***0.031***0.030***0.028***0.038***0.028***0.032***0.038***0.035***0.028***0.030***0.032***0.032***0.028***0.038***0.031***0.034***0.037***0.027***0.031***0.032***0.031***0.023***0.032***0.033***0.024***0.031***0.029***0.035***0.031***0.039***0.028***0.027***0.029***0.037***0.033***
(0.001)(0.002)(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)(0.001)(0.002)(0.002)(0.002)(0.002)(0.002)(0.001)(0.001)(0.002)(0.001)(0.002)(0.002)(0.002)(0.001)(0.002)(0.002)(0.002)(0.001)(0.002)(0.001)(0.001)(0.002)(0.001)(0.003)(0.002)(0.001)(0.001)(0.001)(0.001)(0.001)(0.002)(0.001)(0.001)(0.001)(0.002)(0.001)(0.001)(0.001)(0.002)(0.002)
Observations1,0561,0561,0561,0561,0561,0561,0561,0561,0561,0091,0561,0121,0491,0291,0561,0561,0531,0191,0491,0101,0351,0021,0561,0561,0561,0561,0561,0561,0561,0561,0561,0081,0321,0331,0561,0561,0561,0561,0561,0561,0561,0561,0561,0561,0561,0321,0561,0561,056
R20.3280.2410.4820.3270.3230.3730.3540.3310.2330.1880.2370.2560.2820.1920.2570.3090.3940.3390.3170.3190.3030.1980.3610.2440.1810.2080.2890.3510.3420.4810.3550.3910.1300.2790.3190.2280.3240.3560.2950.1500.3270.4580.3090.3630.3360.2510.3100.2490.275
Adjusted R20.3260.2400.4810.3260.3220.3720.3530.3300.2320.1860.2360.2540.2810.1910.2560.3070.3930.3380.3160.3170.3010.1960.3600.2430.1790.2060.2870.3490.3410.4800.3540.3900.1290.2780.3180.2260.3220.3550.2930.1480.3260.4570.3070.3620.3340.2500.3090.2470.274
Residual Std. Error0.4790.9470.4460.5450.4610.5230.5660.5260.4200.3250.6740.6720.6620.8230.6660.5740.4470.6730.5260.5970.7300.9120.4710.6730.8720.7840.5610.6480.5490.4450.6370.4361.0130.6500.5740.5390.5870.5650.4730.9510.5270.4790.5880.6550.4960.5850.5450.8070.686
F Statistic256.500***167.310***490.094***256.296***251.236***313.280***289.108***260.368***160.024***116.337***163.696***173.183***205.546***122.182***182.215***235.057***341.894***260.712***242.720***235.364***224.077***123.213***297.755***170.029***116.209***137.930***213.560***284.382***273.757***488.764***289.556***323.130***77.034***199.495***246.954***155.369***251.955***291.294***219.813***92.842***255.715***444.361***235.117***300.432***266.043***172.519***236.543***174.178***199.654***
Note: *p<0.1; **p<0.05; ***p<0.01

As we can see here, this regression is not statistically significant since none of the CO2 emission values are statisically or economically significant predictors for temperature overtime. This may be because we are aggregating all of the CO2 values together instead of picking and chosing which ones may be more important. However, I think the bigger picture may be that CO2 emissions may have become more prominent in later years, and therefore, regressing them against temperature for earlier years on all sectors may not be that meanigful. Instead, let's pick and chose more important vairable to groupby.

In this next model, we can isolate yearly values, since in the previous regression, the only significant variables was the year variables. Therefore, there must be a correlation between the year variable, and CO2 emissions as well as temperature changes. In this next regression we can also control for the different types of sectors by creating dummy variables, to see which sectors may be more/less significant overtime. Note that we don't just use the total emissions secotr in this model since we want to find out which secotr may produce the most statistically significant results overtime. In this model, the y variables is the average temperature in a given year from 1970-2013.

Here is this model:

$$ {AverageTemperature}_i = \beta_0 + \beta_1 {CO2emissions}_i + \beta_2 {ResidentialCO2}_i + + \beta_3 {IndustrialCO2}_i + \beta_4 {ElectricCO2}_i + \beta_5 {TransportationCO2}_i + \beta_6 {CommercialCO2}_i + iu_i $$

where:

  • $ \beta_0 $ is the intercept of the linear trend line on the y-axis
  • $ \beta_1 $ is the CO2 emissions in total for that given year in million metric tons of CO2
  • $ \beta_{2...6} $ are the CO2 emissions dummy variables per sector type overtime for that given year in million metric tons of CO2
  • $ u_i $ is a random error term (deviations of observations from the linear trend due to factors not included in the model)
In [206]:
grouped_years = merged_df.groupby('year')
In [244]:
lst5 = []
for groups in grouped_years.groups:
    year_df = grouped_years.get_group(groups)
    X = year_df[['value', 'Residential carbon dioxide emissions', 'Industrial carbon dioxide emissions', 'Electric Power carbon dioxide emissions', 'Transportation carbon dioxide emissions', 'Commercial carbon dioxide emissions']]
    Y = year_df[['AverageTemperature']]
 
    X = sm.add_constant(X)
    model = sm.OLS(Y, X).fit()
    lst5.append(model)
    predictions = model.predict(X) 
In [249]:
years_lst = []
for year in range(1970,2014):
    years_lst.append(str(year))
anotherlst = []
for _ in range(2014-1970):
    anotherlst.append(1)
stargazer11 = Stargazer(lst5)
stargazer11.custom_columns(years_lst, anotherlst)
stargazer11.show_model_numbers(False)
stargazer11.show_degrees_of_freedom(False)

HTML(stargazer11.render_html())
Out[249]:
Dependent variable:AverageTemperature
19701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013
Commercial carbon dioxide emissions0.8380.9350.9070.981*0.946*1.048*0.8550.980*1.022*1.028*1.078**1.042**1.311**1.134**1.212**1.275**1.228**1.107**1.112**1.172**1.160**1.217**1.201**1.318**1.147**1.265**1.317**1.234**1.300**1.292**1.322**1.284**1.237**1.252**1.319**1.312**1.482***1.349**1.395**1.412**1.393***1.467***1.389***1.286**
(0.548)(0.568)(0.581)(0.532)(0.563)(0.561)(0.521)(0.534)(0.550)(0.529)(0.543)(0.503)(0.582)(0.507)(0.549)(0.569)(0.565)(0.507)(0.516)(0.548)(0.560)(0.539)(0.540)(0.548)(0.561)(0.541)(0.568)(0.536)(0.547)(0.540)(0.537)(0.521)(0.547)(0.530)(0.538)(0.527)(0.519)(0.541)(0.558)(0.555)(0.503)(0.553)(0.534)(0.504)
Electric Power carbon dioxide emissions0.6700.7430.7120.7590.7260.8000.6470.7300.6890.7570.7680.7290.926*0.7820.8400.8690.8570.7530.7560.7870.8350.8110.8050.872*0.7880.8400.8740.8090.8380.8290.8410.8150.8160.7950.8390.8200.927*0.8410.8650.890*0.917*0.925*0.886*0.831*
(0.534)(0.554)(0.566)(0.517)(0.548)(0.545)(0.505)(0.517)(0.531)(0.512)(0.524)(0.485)(0.562)(0.488)(0.529)(0.547)(0.544)(0.487)(0.496)(0.526)(0.541)(0.517)(0.518)(0.525)(0.539)(0.518)(0.545)(0.513)(0.525)(0.515)(0.513)(0.497)(0.526)(0.506)(0.513)(0.502)(0.495)(0.515)(0.531)(0.529)(0.480)(0.528)(0.511)(0.483)
Industrial carbon dioxide emissions0.6130.6990.6820.7360.7130.8080.6630.7620.7760.8010.8540.835*1.078*0.942*0.993*1.055*1.002*0.920*0.924*0.976*1.020*1.011*0.989*1.092**1.053*1.049**1.093**1.025*1.110**1.089**1.125**1.093**1.099**1.076**1.130**1.132**1.271**1.163**1.206**1.237**1.282***1.267**1.183**1.099**
(0.530)(0.551)(0.564)(0.516)(0.549)(0.545)(0.506)(0.519)(0.534)(0.515)(0.529)(0.490)(0.569)(0.496)(0.536)(0.556)(0.554)(0.495)(0.504)(0.536)(0.550)(0.527)(0.527)(0.535)(0.551)(0.529)(0.555)(0.524)(0.538)(0.528)(0.526)(0.510)(0.539)(0.520)(0.527)(0.517)(0.509)(0.530)(0.547)(0.545)(0.495)(0.542)(0.524)(0.494)
Residential carbon dioxide emissions0.7920.8840.8590.936*0.9030.997*0.8150.938*0.979*0.993*1.045*1.012**1.273**1.104**1.178**1.236**1.194**1.075**1.080**1.136**1.130**1.186**1.168**1.278**1.115**1.232**1.279**1.203**1.288**1.259**1.288**1.252**1.226**1.219**1.288**1.280**1.512***1.319**1.361**1.380**1.363***1.438***1.366**1.259**
(0.544)(0.564)(0.577)(0.528)(0.560)(0.557)(0.518)(0.531)(0.547)(0.527)(0.540)(0.501)(0.579)(0.505)(0.547)(0.567)(0.563)(0.504)(0.514)(0.546)(0.558)(0.537)(0.538)(0.546)(0.558)(0.539)(0.566)(0.534)(0.543)(0.538)(0.535)(0.519)(0.546)(0.528)(0.536)(0.526)(0.517)(0.539)(0.556)(0.553)(0.502)(0.551)(0.533)(0.503)
Transportation carbon dioxide emissions0.6570.7260.7010.7540.7200.7890.6480.7370.6980.7690.8060.7720.954*0.828*0.895*0.929*0.8940.7940.8020.8470.8730.878*0.867*0.955*0.8370.908*0.949*0.888*0.918*0.915*0.937*0.909*0.8670.882*0.923*0.913*1.017**0.929*0.979*0.976*0.978**1.024*0.957*0.894*
(0.533)(0.553)(0.565)(0.517)(0.547)(0.544)(0.505)(0.518)(0.531)(0.513)(0.526)(0.487)(0.563)(0.490)(0.531)(0.550)(0.546)(0.489)(0.498)(0.529)(0.543)(0.520)(0.520)(0.529)(0.541)(0.522)(0.548)(0.517)(0.527)(0.519)(0.517)(0.501)(0.528)(0.510)(0.517)(0.506)(0.498)(0.519)(0.536)(0.532)(0.483)(0.531)(0.513)(0.485)
const9.789***9.727***9.458***10.209***9.950***9.699***9.589***10.042***9.258***9.302***9.775***10.163***9.180***9.715***9.635***9.265***10.067***10.384***9.886***9.398***10.489***10.390***9.685***9.188***10.089***9.806***9.245***9.614***10.980***10.514***9.958***10.412***10.506***10.002***10.021***10.325***10.492***10.316***9.570***9.516***9.925***10.104***11.162***11.572***
(0.419)(0.435)(0.445)(0.406)(0.432)(0.429)(0.399)(0.409)(0.419)(0.404)(0.414)(0.384)(0.445)(0.388)(0.421)(0.436)(0.434)(0.388)(0.396)(0.419)(0.429)(0.413)(0.413)(0.420)(0.435)(0.415)(0.435)(0.410)(0.419)(0.414)(0.411)(0.399)(0.419)(0.406)(0.412)(0.405)(0.398)(0.415)(0.428)(0.426)(0.389)(0.424)(0.409)(0.386)
value0.021***0.023***0.021***0.022***0.023***0.025***0.019***0.022***0.023***0.022***0.024***0.024***0.033***0.028***0.028***0.029***0.028***0.025***0.024***0.024***0.027***0.026***0.025***0.027***0.023***0.025***0.025***0.023***0.025***0.024***0.024***0.023***0.024***0.022***0.023***0.023***0.025***0.024***0.025***0.028***0.027***0.028***0.028***0.025***
(0.005)(0.005)(0.005)(0.004)(0.005)(0.005)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.005)(0.005)(0.005)(0.005)(0.005)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.003)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)(0.004)
Observations1,1761,1761,1761,1761,1691,1761,1761,1761,1611,1761,1761,1761,1531,1761,1761,1761,1451,1761,1761,1761,1371,1761,1761,1761,1291,1761,1761,1761,1211,1761,1761,1761,1131,1761,1761,1761,1051,1761,1761,1761,1151,1761,1761,176
R20.0150.0170.0150.0210.0180.0210.0160.0210.0240.0230.0240.0260.0310.0300.0280.0290.0270.0280.0270.0270.0310.0290.0280.0330.0260.0300.0310.0300.0360.0320.0340.0340.0350.0310.0330.0330.0460.0340.0340.0350.0470.0390.0380.037
Adjusted R20.0100.0120.0100.0160.0130.0160.0110.0150.0180.0180.0190.0210.0260.0250.0240.0240.0220.0230.0220.0220.0260.0240.0230.0280.0210.0250.0260.0250.0310.0270.0290.0290.0290.0260.0280.0290.0410.0290.0290.0300.0420.0340.0330.032
Residual Std. Error5.0465.2375.3464.8945.1685.1594.7864.9105.0234.8684.9964.6245.2964.6515.0375.2195.1064.6404.7265.0215.0704.9394.9375.0124.9894.9465.2004.9034.8864.9274.9084.7594.8874.8424.9104.8064.5844.9285.0815.0594.4505.0514.8864.616
F Statistic2.910***3.375***2.984***4.205***3.522***4.240***3.213***4.082***4.632***4.626***4.875***5.245***6.118***5.951***5.714***5.871***5.328***5.535***5.357***5.322***6.019***5.904***5.659***6.564***5.040***6.121***6.152***6.076***6.946***6.380***6.849***6.861***6.631***6.265***6.717***6.749***8.878***6.809***6.780***7.137***9.079***7.901***7.667***7.481***
Note: *p<0.1; **p<0.05; ***p<0.01

Here we can see that all of the values for the CO2 emissiona are statistically significant at the 99 percent confidence level. Meaning that, when grouping by year, CO2 emissions are strong predictors of annual average temperature. However, we must note that the R^2 value is very low, meaning that model does not explain most of variance surrounding the average temperature, but its still significant. This could indicate that factors other than CO2 emissions are more relevent in predicting temperature overtime. The model however, is still statisically significant and therefore can be used for forther analysis. All of the coefficients are positive in this model, meaning that CO2 emissions and yearly temperature are positively correlated, which is not a big surprise in this case. We can also note that since this data was annual inctead of monehtly, a lot of valuable observations were aggregated, therefore, the model may lost some precision.

When breaking down the sector analysis, we can see that Industrial, Residential and Commerical CO2 emissions are the most statisically significant, and therefore, contribute most to the temperature changes overtime. Also, the coefficients for these three have an upward trend, meaning that they contribute more overtime to CO2 emissions, and therefore, to the changes in temperature. For an interpretation for the year 1970, we can say that, after controlling for the different types of sector emissions, a 0.021 degree Celsius increase in temperature can be observed if we increase overall CO2 emissions in 1 million metric tons of CO2 in the year 1970 for all the states in the US.

So, from these findings we can see that there are other variables that may be able to predict how temperature may change overtime. CO2 emissions are slow acting and for this very small dataset, even though the R^2 is very small, we can still make inferences on how changing variables can effect our results.

Conclusion:¶

While there are many factors that can contribute to temperature changes in individual states, there are some economic indicators that may help explain why certain states have experienced more dramatic temperature changes over time than others. Here are a few possibilities that we have uncovered through our findings:

Sector based CO2 activity: Certain sectors such as residential and industrial sectors play a more important role in determining how a state's temperature may behave. Overtime, sector based CO2 activities can become better determinants of predicting state temperature. Intuitively, this makes sense, since larger industrial activity may point to higher CO2 emissions and larger population changes may cause larger residential CO2 emissions.

We have also noted that geographically, northern states exhibit more volatile temperature changes overtime, no matter what the season is, and southern states (or the "hotter" states) are less effected by economic variables such as population changes overtime, due to thier systemtically high temperatures.

Overtime policymakers should take into account these economic findings to make decisions that effect state level output activity caused by more industrialization and immigration policies to assess how climate change can be effected by population changes.